15CS72 ACA Module1 Chapter3FinalCopy
15CS72 ACA Module1 Chapter3FinalCopy
Degree of Parallelism
The plot of the DOP as a function of time is called the parallelism profile of a given
program. The profile will fluctuate depending upon algorithmic structure, program
optimization, resource utilization and run time conditions of computer system. DOP is
defined under the assumptions of having unlimited resources.
Average Parallelism
Available Parallelism
The computation that is less numeric than that in scientific codes has relatively little
parallelism even when basic block boundaries are ignored. A basic block is a sequence
or block of instructions in a program that has a single entry and a single exit points. Also
the available parallelism can be increased by compiler optimization and algorithm
redesign.
Asymptotic SpeedUp
Amount of work when DOP is i is Wi= iΔti
The execution time on a single processor is ti(1)= Wi/Δ
The execution time on k processors is ti(k)= Wi/kΔ
With infinite number of processors ti(∞)= Wi/iΔ
The response time of T(1) is the response time of a single processor system when
executing Wi. And T(∞) is the response time of executing the same workload Wi if an
infinite number of processors is available. At this point the asymptotic speedup S∞ is
defined as the ratio of T(1) to T(∞).
m
Ra= ∑ Ri/m
i=1
The expression Ra assumes equal weight (1/m) for all programs. If the programs are
weighted with the distribution 𝛑 = {fi|i=1,2,3,...,m} we define the weighted arithmetic
mean execution rate as
m
R*a= ∑ fiRi
i=1
However the arithmetic mean execution rate fails to represent the real time consumed
by the benchmarks when they are actually executed in the presence of high anomaly
compared to geometric mean.
Harmonic Mean Performance : It measures the average execution time across a large
number of programs
The harmonic mean execution rate across m benchmark programs is thus defined as
Rh=1/Ta
m
Rh =
m / ( ∑ 1/Ri )
i=1
If the programs are weighted with the distribution 𝛑 = {fi|i=1,2,3,...,m} we define the
weighted harmonic mean execution rate as
m
R*h =
1 / ( ∑ fi/Ri )
i=1
Compared to arithmetic mean the harmonic mean execution rate is more closer to real
performance.
Amdahl’s Law
This law can be derived from the previous laws under the assumption that the system
works only in two modes, fully sequential system with probability ɑ or fully parallel mode
with probability 1-ɑ. Amdahl’s speedup expression Sn equals
to the ratio of the total
number of processors to 1+(n-1)ɑ
Sn = n / 1+(n-1)ɑ
This implies that under the above assumption, the best speedup that we can get is
upper bounded by 1/ ɑ regardless of how many processors the system actually have
because
Sn → 1/a as n → ∞. In Figure, Sn is plotted as a function of n for different values of ɑ.
Note that the ideal speedup achieved when ɑ= 0, and the speedup drops sharply as ɑ
increases
Let O(n) be the total number of unit operations performed by n processor system. T(n)
be the execution time in unit time steps. Assume T(1) = O(1) in uniprocessor system.
The speedup factor is given as follows
S(n)= T(1)/T(n)
The system efficiency for n processor system is defined as
E(n)=S(n)/ n
E(n)= T(1)/ ( n. T(n))
The lowest efficiency corresponds to the case of the entire program code being
executed sequentially on a single processor, the other processors remain idle. The
maximum efficiency is achieved when all n processors are fully utilized throughout the
execution period.
Also 1<=S(n)<=n and 1/n <= E(n) <=1
Redundancy and Utilization
The redundancy in a parallel computation is defined as the ratio of O(n) to O(1). It
signifies the extent to which software parallelism matches the hardware parallelism
R(n)= O(n)/O(1)
The system utilization is defined as follows
U(n)= R(n). E(n)
U(n)= O(n) / (n.T(n))
The system utilization indicates the percentage of resources (processors, memories,
etc.) that was kept busy during the execution of a parallel program.
Quality of Parallelism
The quality of parallelism is directly proportional to the speedup and efficiency and
inversely related to the redundancy. Thus, we have
Q(n)= S(n). E(n) / R(n)
Whetstone
This is a Fortran-based synthetic benchmark assessing the floating-point performance,
measured in the number of KWhetstones/s that a system can perform. The benchmark
includes both integer and floating-point operations involving array indexing, subroutine
calls, parameter passing, conditional branching, and trigonometric and transcendental
functions.Both the Dhrystone and Whetstone are synthetic benchmarks whose
performance results depend heavily on the compilers used. Both benchmarks were
criticized for being unable to predict the performance of user programs. The sensitivity
to compilers is a major drawback of these benchmarks.
On the other hand, data parallelism is much higher than instruction parallelism. Data
parallelism is parallelization across multiple processors and it focuses on distributing the
data across different nodes, which operate on the data in parallel. Data parallelism
refers to a situation where the same operation (instruction or program} executes over a
large array of data. Data parallelism has been implemented on pipelined vector
processors, SIMD array processors, and SPMD or MPMD multicomputer systems.
Efficiency Curves
If the workload is unchanged or constant , the efficiency will decrease rapidly as shown
in the figure. Hence it is necessary to increase the machine size and problem size
proportionally. Such a system is known as a scalable computer for solving scalable
problems. The efficiency curve is denoted as 𝛼 which corresponds to Amdahl’s law
If the workload is linear it is ideal case and efficiency curve denoted by 𝛾 is almost flat.
If the workload is not linear , the second choice is to have sublinear scalability as close
to linearity as possible as illustrated by curve 𝜷 in figure. The sublinear efficiency curve
𝜷 lies somewhere between curves 𝛼 and 𝛾.
If the workload has exponential growth pattern the system is poorly scalable. The
reason is that to keep a constant efficiency or a good speedup, the increase in workload
with problem size becomes explosive and exceeds the memory or IO limits. Thus the
efficiency curve 𝛉 is achievable only with exponentially increased memory capacity.
Application Models: There are three speedup performance models defined below which
are bounded by limited memory, limited tolerance for latency of IPC and limited IO
bandwidth.
Scalability Analysis
Scalability analysis determines whether parallel processing of a given problem can offer
the desired improvement in performance. The analysis should help guide the design of
a massively parallel processor.
Some tradeoffs in scalability analysis
1. Computer cost and programming overhead should be considered for Scalability
analysis.
2. Also multi user environment should be considered for scalability analysis where
multiple programs are executed concurrently by sharing the available resources.
Parallel algorithms are specially designed for parallel computers. The characteristics of
parallel algorithms are listed below.
1. Deterministic algorithm, for a given particular input, the computer will always
produce the same output but in the case of non-deterministic algorithm, for the
same input, the compiler may produce different outputs in different runs. Hence
Deterministic algorithms are implementable on real machines and have
polynomial time complexity.
2. Computational granularity :Granularity decides the size of data items and
program modules used in computation. in this sense, we also classily algorithms
as fine-grain, medium -grain, or coarse-grain.In fine-grained parallelism, a
program is broken down to a large number of small tasks. In coarse-grained
parallelism, a program is split into large tasks.Medium-grained parallelism is a
compromise between fine-grained and coarse-grained parallelism, where we
have task size and communication time greater than fine-grained parallelism and
lower than coarse-grained parallelism. Most general-purpose parallel computers
fall in this category.
3. Parallelism Profile: The effectiveness of parallel algorithms is determined by the
distribution of degree of parallelism.
4. Communication patterns and synchronization requirements: Communication
patterns address both memory access and interprocessor communications. The
partems can he static or dynamic. Static algorithms are more suitable for SIMD
or pipelined machines, while dynamic algorithms are for MIMD machines. The
synchronization frequency also affects the efficiency of algorithm.
5. Uniformity of Operations: It indicates that same or uniform operations are
performed on the dataset and data is partitioned across various parallel nodes for
data level parallelism. Suitable for SIMD processing or pipelining.
6. Memory requirement and data structures: Memory efficiency gets affected by the
usage of different data structures and data movement patterns in algorithm. The
efficiency of parallel algorithm can be computed by analyzing the space and time
complexity.
Isoefficiency
Let w be workload and s is the problem size. w=w(s). The size of machine is n. Let h be
the communication overhead for parallel algorithm and it is function of both machine
size n and problem size s.
h=h(s,n)
The efficiency of parallel algorithm implemented on parallel computer is
E= w(s)/ (w(s) + h(s,n))
A small isoefficiency function means that small increments in the problem size are
sufficient for the efficient utilization of an increasing number of processing elements,
indicating that the parallel system is highly scalable. However, a large isoefficiency
function indicates a poorly scalable parallel system.
In general the overhead h increase with increasing values of n and s. According to the
concept of isoefficiency efficiency should remain constant when implementing a parallel
algorithm on parallel computer.
In order to maintain the constant efficiency E the workload w(s) should grow in
proportion to overhead h(s,n) . The isoefficiency function of common parallel algorithm
are polynomial functions of n i.e. they are O( nk ) for some k>=1. The smaller the power
of n the more scalable the parallel system.
w(s)= ( E / (1-E) ) * h(s,n)
The factor C= E / (1-E) is constant for fixed efficiency E. Thus we define the
isoefficiency function as follows. Iso efficiency function is used for scalability analysis.
fE(n) = C*h(s,n)
Consider a situation where the system can operate in only two modes i.e. sequential
mode with DOP=1 and perfectly parallel mode with DOP=n Then Sn is given as follows
Hence execution time reduces when the parallel portion of the program is executed by n
processors, but the sequential portion of the program does does not change and
requires constant amount of time as shown in figure below. Thus
W1+Wn= α + (1 − α) = 1
Here α represents the percentage of the program that must be executed sequentially
and 1- α corresponds to portion of the code that can be executed in parallel.The total
amount of workload W1+Wn is kept constant.
The speedup curve decreases very rapidly as the α increases. This means that with a
small percentage of the sequential code the entire performance cannot go higher than
1/ α . Hence this α is the sequential bottleneck in the program which cannot be solved by
increasing the number of processors.
The fixed time speedup is defined under the assumption that T(1)= T’(n). The general
formula for fixed time speedup is given below.
Gustafson’s law
As shown in the figure below, the workload increases corresponding to the increase in
number of processors and execution time remains unchanged or constant.
Note that the slope of the Sn curve
in figure is much flatter .This implies that Gustafson‘s
law does support scalable performance as the machine size increases. The idea is to
keep all processors busy by increasing the problem size. When the problem can scale
to match available computing power, the sequential fraction is no longer a bottleneck.
3.3.3 Memory-Bounded Speedup Model
Some scientific and engineering applications require a lot of memory space. The idea
here is to solve a problem of largest size, which is limited by the memory capacity. This
model may result in an increase in execution time to achieve scalable performance. The
fixed-memory model assumes a scaled workload and allows an increase in execution
time. The increase in workload (problem size) is memory bound.
Fixed Memory Speedup
Let M be the memory requirement of a given problem and W be the computational
workload. W=g(M) and M= g −1 (W )
Consider that the system operates in two modes i.e. sequential and perfectly parallel
mode. The speedup equation is given as follows
The above speedup equation is applicable for the following three cases.
Case 1: G(n)=1 This corresponds to thc case where the problem size is fixed.
Case 2: G(n) =nThis applies to the case where the workload increases n times when the
memory is increased n times.
Case 3:G(n)>n This corresponds to the situation where the computational workload
increases faster than the memory requirement.
T(s,1) is the sequential execution time on uniprocessor system and T(s,n) is the
execution time on parallel machine with n processor. The asymptotic speedup is S(s,n)
and is defined as follows
S(s,n)= T(s,1) / ( T(s,n) + h(s,n))
Where h(s,n) : communication overhead plus I/O overhead
T(s,n) can be minimized by increasing the number of processors.
The system efficiency is given as follows
E(s,n)= S(s,n)/n
The best possible efficiency is one, implying that the best speedup is linear or S(s,n)=n
A system it scalable if the system efficiency E(s,n) = 1 for all algorithms with any
number of n processors and any problem size s
In Data flow computers the instructions will get executed when the data or operands are
available.Thus in data flow instruction will be ready for execution when the operands
become available. Thus the data tokens are passed directly between the instructions.
The data generated by an instruction will be duplicated into multiple copies and will be
forwarded to all needy instructions. Data token once consumed by an instruction will no
longer be available for reuse. It does not require a PC or control sequencer. But
mechanisms should exist to check data or operand availability and match data tokens
with needy instructions.
The architecture of MIT tagged token dataflow computer is shown below.
a= ((b+1)*c-(d/e))
Two reduction models for Demand Driven Mechanisms are given below
In a string reduction model, each demander gets a separate copy of the expression for
its own evaluation. A long string expression is reduced to a single value in a recursive
fashion.An expression is said to be fully reduced when all the arguments have been
replaced by literal values.
In a graph reduction model, the expression is represented as a directed graph. The
graph is reduced by evaluation of branches or subgraphs. Different parts of a graph or
subgraphs can be reduced or evaluated in parallel upon demand.