Slides
Slides
Module 5
Analytical Modeling of Parallel Systems
168
Contents
1. Effect of Granularity on Performance
2. Scalability of Parallel Systems
3. Minimum Execution Time and Minimum Cost-Optimal
Execution Time
4. Asymptotic Analysis of Parallel Programs
5. Other Scalability Metrics
169
Weekly Learning Outcomes
170
Required Reading
Chapter 5 Analytical Modeling of Parallel Systems: (Ananth Grama,
Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text
``Introduction to Parallel Computing'', Addison Wesley, 2003.)
Recommended Reading
Granularity in Parallel Computing
https://www.youtube.com/watch?v=AlzOErpaXE8
171
Effect of Granularity on Performance
vOften, using fewer processors improves performance of parallel systems.
vUsing fewer than the maximum possible number of processing elements
to execute a parallel algorithm is called scaling down a parallel system.
vA naive way of scaling down is to think of each processor in the original
case as a virtual processor and to assign virtual processors equally to
scaled down processors.
vSince the number of processing elements decreases by a factor of n / p,
the computation at each processing element increases by a factor of n / p.
vThe communication cost should not increase by this factor since some of
the virtual processors assigned to a physical processors might talk to each
other. This is the basic reason for the improvement from building
granularity.
Building Granularity: Example
• Consider the problem of adding n numbers on p processing elements such that p
< n and both n and p are powers of 2.
• Use the parallel algorithm for n processors, except, in this case, we think of them
as virtual processors.
• The first log p of the log n steps of the original algorithm are simulated in (n / p)
log p steps on p processing elements.
• The cost is Θ (n log p), which is asymptotically higher than the Θ (n)
cost of adding n numbers sequentially. Therefore, the parallel system
is not cost-optimal.
Building Granularity: Example (continued)
Can we build granularity in the example in a cost-optimal fashion?
• Each processing element locally adds its n / p numbers in time Θ (n / p).
• The p partial sums on p processing elements can be added in time Θ(n /p).
(3)
• The cost is
A comparison of the speedups obtained by the binary-exchange, 2-D transpose and 3-D
transpose algorithms on 64 processing elements with tc = 2, tw = 4, ts = 25, and th = 2.
Clearly, it is difficult to infer scaling characteristics from observations on small
datasets on small machines.
Scaling Characteristics of Parallel Programs
• The efficiency of a parallel program can be written as:
S is the speedup
p is the number of processors
Ts is the execution time of the sequential (non-parallel) program
Tp is the execution time of the parallel program
or (4)
= (7)
Efficiency decreases as the number of processors
increases because more processors introduce more
communication overhead, which reduces the benefit of parallelization.
Scaling Characteristics of Parallel Programs: Example
(continued)
Plotting the speedup for various input sizes gives us:
Speedup versus the number of processing elements for adding a list of numbers.
Speedup tends to saturate and efficiency drops as a consequence of Amdahl's law
Scaling Characteristics of Parallel Programs
vTotal overhead function To is a function of both problem size Ts and
the number of processing elements p.
v In many cases, To grows sublinearly with respect to Ts.
vIn such cases, the efficiency increases if the problem size is increased
keeping the number of processing elements constant.
vFor such systems, we can simultaneously increase the problem size
and number of processors to keep efficiency constant.
vWe call such systems scalable parallel systems.
Scaling Characteristics of Parallel Programs
(8)
(9)
(11)
(12)
Isoefficiency Metric of Scalability
vThe problem size W can usually be obtained as a function of p by
algebraic manipulations to keep efficiency constant.
vThis function determines the ease with which a parallel system can
maintain a constant efficiency and hence achieve speedups increasing
in proportion to the number of processing elements
Isoefficiency Metric: Example
v The overhead function for the problem of adding n numbers on p processing elements is
approximately 2p log p .
= (13)
v If the number of processing elements is increased from p to p’, the problem size (in this
case, n ) must be increased by a factor of (p’ log p’) / (p log p) to get the same efficiency
as on p processing elements.
Isoefficiency Metric: Example
Consider a more complex example where
• Using only the first term of To in Equation 12, we get
= (14)
(15)
(16)
• From this, we have:
(17)
(18)
vThe n variables must be eliminated one after the other, and eliminating each
variable requires Θ(n2) computations.
vAt most Θ(n2) processing elements can be kept busy at any time.
vSince W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W2/3) .
vGiven p processing elements, the problem size should be at least Ω(p3/2) to use
them all.
Minimum Execution Time and Minimum Cost-
Optimal Execution Time
Often, we are interested in the minimum time to solution.
=0 (19)
= (21)
vIf the isoefficiency function of a parallel system is Θ(f(p)) , then a problem of size W can be solved
cost-optimally if and only if
v W= Ω(f(p)) .
= (22)
Asymptotic Analysis of Parallel Programs
• Consider the problem of sorting a list of n numbers. The fastest serial
programs for this problem run in time Θ(n log n). Consider four parallel
algorithms, A1, A2, A3, and A4 as follows:
• Comparison of four different algorithms for sorting a given list of numbers.
The table shows number of processing elements, parallel runtime, speedup,
efficiency and the pTP product.
Asymptotic Analysis of Parallel Programs
vIf the metric is speed, algorithm A1 is the best, followed by A3, A4,
and A2 (in order of increasing TP).
vIn terms of efficiency, A2 and A4 are the best, followed by A3 and A1.
(24)
Where:
tc : Computation time per element.
ts : Setup time for the processors.
or tw : Communication time between processors.
p: Number of processors.
C: A constant related to data or operations size
• This is not a particularly scalable system
Scaled Speedup: Example (continued)
Consider the case of time-constrained scaling.
vThis is not surprising, since the memory and time complexity of the
operation are identical.
Scaled Speedup: Example
• The serial runtime of multiplying two matrices of
dimension n x n is tcn3.
(25)
(26)
Serial Fraction f
• The serial fraction f of a parallel program is defined as:
• Therefore, we have:
Serial Fraction
vSince S = W / TP , we have
(27)