0% found this document useful (0 votes)
73 views36 pages

Parallel Programming: Sathish S. Vadhiyar Course Web Page

This document provides an overview of parallel programming. It discusses the motivation for parallel programming including faster execution times and addressing resource constraints. It also covers challenges in parallel programming like avoiding communication delays and idling processes. Methods for evaluating parallel programs are presented, including metrics like speedup and efficiency. Different types of parallel algorithms and architectures are described at a high level.

Uploaded by

Anna Poorani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views36 pages

Parallel Programming: Sathish S. Vadhiyar Course Web Page

This document provides an overview of parallel programming. It discusses the motivation for parallel programming including faster execution times and addressing resource constraints. It also covers challenges in parallel programming like avoiding communication delays and idling processes. Methods for evaluating parallel programs are presented, including metrics like speedup and efficiency. Different types of parallel algorithms and architectures are described at a high level.

Uploaded by

Anna Poorani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 36

Parallel Programming

Sathish S. Vadhiyar
Course Web Page:
http://www.serc.iisc.ernet.in/~vss/courses/PPP2007
Outline
 Motivation for parallel programming
 Challenges in parallel programming
 Evaluating a parallel
program/algorithm – speedup,
efficiency, scalability analysis
 Parallel Algorithm – Design, Types
and Models
 Parallel Architectures
Motivation for Parallel Programming
• Faster Execution time due to non-dependencied
between regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data
throughput.
• Clock rate improvement in the past decade – 40%
• Memory access time improvement in the past decade
– 10%
• Grand challenge problems (more later)
Challenges / Problems in Parallel
Algorithms
 Building efficient algorithms.
 Avoiding
 Communication delay
 Idling
 Synchronization
Challenges

P0

P1

Idle time
Computation

Communication

Synchronization
How do we evaluate a parallel
program?
 Execution time, Tp
 Speedup, S
 S(p, n) = T(1, n) / T(p, n)
 Usually, S(p, n) < p
 Sometimes S(p, n) > p (superlinear speedup)
 Efficiency, E
 E(p, n) = S(p, n)/p
 Usually, E(p, n) < 1
 Sometimes, greater than 1
 Scalability – Limitations in parallel
computing, relation to n and p.
Speedups and efficiency

S E

Ideal p p

Practical
Limitations on speedup – Amdahl’s
law
 Amdahl's law states that the performance
improvement to be gained from using some
faster mode of execution is limited by the
fraction of the time the faster mode can be
used.
 Overall speedup in terms of fractions of
computation time with and without
enhancement, % increase in enhancement.
 Places a limit on the speedup due to
parallelism.
 Speedup = 1
(fs + (fp/P))
Amdahl’s law Illustration
S = 1 / (s + (1-s)/p)

1 Efficiency
0.8

0.6

0.4

0.2

0
Courtesy: 0 5 10 15

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
Amdahl’s law analysis
f P=1 P=4 P=8 P=16 P=32
1.00 1.0 4.00 8.00 16.00 32.00
0.99 1.0 3.88 7.48 13.91 24.43
0.98 1.0 3.77 7.02 12.31 19.75
0.96 1.0 3.57 6.25 10.00 14.29
•For the same fraction, speedup numbers keep moving away from
processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large
enough to match a given number of processors.
Gustafson’s Law
 Amdahl’s law – keep the parallel work fixed
 Gustafson’s law – keep computation time
on parallel processors fixed, change the
fraction of parallel work to match the
computation time
 Serial component of code is independent of
problem size
 Parallel component scales as problem size
which scales as number of processors
 Scaled Speedup, S =
(Seq + Par(P)*P)/(Seq + Par(P))
Metrics (Contd..)

Table 5.1: Efficiency as a function of n and p.

N P=1 P=4 P=8 P=16 P=32


64 1.0 0.80 0.57 0.33
192 1.0 0.92 0.80 0.60
512 1.0 0.97 0.91 0.80
Scalability
 Efficiency decreases with increasing P;
increases with increasing N
 How effectively the parallel algorithm can
use an increasing number of processors
 How the amount of computation performed
must scale with P to keep E constant
 This function of N in terms of P is called
isoefficiency function.
 An algorithm with an isoefficiency function
of O(P) is highly scalable while an algorithm
with quadratic or exponential isoefficiency
function is poorly scalable
Scalability Analysis – Finite
Difference algorithm with 1D
decomposition
For constant efficiency, a function of P, when substituted for N
must satisfy the following relation for increasing P and constant
E.

Can be satisfied with N = P, except for small P.

Hence isoefficiency function = O(P2) since


computation is O(N2)
Scalability Analysis – Finite
Difference algorithm with 2D
decomposition

Can be satisfied with N = sqroot(P)

Hence isoefficiency function = O(P)

2D algorithm is more scalable than 1D


Parallel Algorithm – Design, Types
and Models
Parallel Algorithm Design -
Components
 Decomposition – Splitting the
problem into tasks or modules
 Mapping – Assigning tasks to
processor
 Mapping’s contradictory objectives
 To minimize idle times
 To reduce communications
Parallel Algorithm Design -
Containing Interaction Overheads
 Maximizing data locality
 Minimizing volume of data exchange
 Minimizing frequency of interactions
 Minimizing contention and hot spots
 Overlapping computations with
interactions
 Overlapping interactions with
interactions
 Replicating data or computations
Parallel Algorithm Types and
Models
 Single Program
Multiple Data
(SPMD)
 Multiple Program
Multiple Data
(MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Algorithm Types and
Models
 Master-Worker / P0

parameter sweep /
task farming P1 P2 P3 P4
 Pipleline / systolic /
wavefront

P0 P1 P2 P3 P4

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Parallel Algorithm Types and
Models
 Data parallel model
 Processes perform identical tasks on different data
 Task parallel model
 Different processes perform different tasks on same
or different data – based on task dependency graph
 Work pool model
 Any task can be performed by any process. Tasks are
added to a work pool dynamically
 Pipeline model
 A stream of data passes through a chain of processes
– stream parallelism
Parallel Architectures
- Classification
- Cache coherence in shared memory platforms
- Interconnection networks
Classification of Architectures –
Flynn’s classification
 Single Instruction
Single Data (SISD):
Serial Computers
 Single Instruction
Multiple Data (SIMD)
- Vector processors
and processor arrays
- Examples: CM-2,
Cray-90, Cray YMP,
Hitachi 3600

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Flynn’s classification
 Multiple Instruction
Single Data (MISD):
Not popular
 Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most
other supercomputers,
clusters,
computational Grids
etc.

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Based on Memory
 Shared memory
 2 types – UMA and NUMA
NUMA Examples: HP-
Exemplar, SGI Origin,
UMA Sequent NUMA-Q

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Classification of Architectures –
Based on Memory
 Distributed memory

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

 Recently multi-cores
 Yet another classification – MPPs, NOW (Berkeley), COW, Computational Grids
Programming Paradigms, Algorithm
Types, Techniques
 Shared memory
model – Threads,
OpenMP
 Message passing
model – MPI
 Data parallel model
– HPF

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Cache Coherence in SMPs
• All processes
read variable ‘x’
CPU0 CPU1 CPU2 CPU3 residing in
cache line ‘a’
•Each process
a a a a updates ‘x’ at
cache0 cache1 cache2 cache3 different points
of time

a
Challenge: To maintain consistent view
of the data Main

Protocols: Memory

•Write update
•Write invalidate
Caches Coherence Protocols and
Implementations
 Write update – propagate cache line to other
processors on every write to a processor
 Write invalidate – each processor get the
updated cache line whenever it reads stale
data
 Which is better??
Caches –False sharing
•Different processors
update different parts of
the same cache line
•Leads to ping-pong of
CPU0 cache lines between CPU1
processors
•Situation better in
A0, A2, A4… A1, A3, A5…
update protocols than
cache0
invalidate protocols. cache1

Why?

A0 – A8
A9 – A15 •Modify the
Main algorithm to change
the stride
Memory
Caches Coherence using invalidate
protocols
 3 states associated with data items
 Shared – a variable shared by 2 caches
 Invalid – another processor (say P0) has updated the data
item
 Dirty – state of the data item in P0
 Implementations
 Snoopy
 for bus based architectures
 Memory operations are propagated over the bus and snooped
 Instead of broadcasting memory operations to all processors,
propagate coherence operations to relevant processors
 Directory-based
 A central directory maintains states of cache blocks, associated
processors
 Implemented with presence bits
Interconnection Networks
 An interconnection network defined by
switches, links and interfaces
 Switches – provide mapping between
input and output ports, buffering, routing
etc.
 Interfaces – connects nodes with network
 Network topologies
 Static – point-to-point communication
links among processing nodes
 Dynamic – Communication links are
formed dynamically by switches
Interconnection Networks
 Static
 Bus – SGI challenge
 Completely connected
 Star
 Linear array, Ring (1-D torus)
 Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
 k-d mesh: d dimensions with k nodes in each
dimension
 Hypercubes – logp-2 mesh – e.g. many MIMD
machines
 Trees – our campus network
 Dynamic – Communication links are formed
dynamically by switches
 Crossbar – Cray X series – non-blocking network
 Multistage – SP2 – blocking network.
Evaluating Interconnection
topologies
 Diameter – maximum distance between any two
processing nodes
 Full-connected – 1
 Star – 2
 Ring – p/2
 Hypercube - logP
 Connectivity – multiplicity of paths between 2 nodes.
Maximum number of arcs to be removed from
network to break it into two disconnected networks
 Linear-array – 1
 Ring – 2
 2-d mesh – 2
 2-d mesh with wraparound – 4
 D-dimension hypercubes – d
Evaluating Interconnection
topologies
 bisection width – minimum number of
links to be removed from network to
partition it into 2 equal halves
 Ring – 2
 P-node 2-D mesh - Root(P)
 Tree – 1
 Star – 1
 Completely connected – P2/4
 Hypercubes - P/2
Evaluating Interconnection
topologies
 channel width – number of bits that can be
simultaneously communicated over a link,
i.e. number of physical wires between 2
nodes, channel rate, channel bandwidth,
bisection bandwidth, cost
 channel rate – performance of a single
physical wire
 channel bandwidth – channel rate times
channel width
 bisection bandwidth – maximum volume of
communication between two halves of
network, i.e. bisection width times channel
bandwidth

You might also like