0% found this document useful (0 votes)
14 views

15CS72 ACA Module1 Chapter3FinalCopy

This document discusses principles of scalable performance in parallel computing, focusing on performance metrics such as degree of parallelism, average parallelism, and efficiency. It also covers benchmarks for measuring performance, including MFLOPS, MIPS, and synthetic benchmarks like Dhrystone and Whetstone. Additionally, it explores the applications of massive parallelism in various fields and the characteristics of parallel algorithms, emphasizing the importance of scalability and efficiency in parallel processing.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

15CS72 ACA Module1 Chapter3FinalCopy

This document discusses principles of scalable performance in parallel computing, focusing on performance metrics such as degree of parallelism, average parallelism, and efficiency. It also covers benchmarks for measuring performance, including MFLOPS, MIPS, and synthetic benchmarks like Dhrystone and Whetstone. Additionally, it explores the applications of massive parallelism in various fields and the characteristics of parallel algorithms, emphasizing the importance of scalability and efficiency in parallel processing.

Uploaded by

Tarun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Module 1: Chapter 3

Principles of Scalable Performance

3.1 Performance metrics and measures

3.1.1 Parallelism profile in programs

Degree of Parallelism

The execution of a program on a parallel computer may use different numbers of


processors at different time periods during the execution cycle. For each time period,
the number of processors used to execute a program is defined as the degree of
Parallelism (DOP). This is a discrete time function, assuming only nonnegative integer
values.

The plot of the DOP as a function of time is called the parallelism profile of a given
program. The profile will fluctuate depending upon algorithmic structure, program
optimization, resource utilization and run time conditions of computer system. DOP is
defined under the assumptions of having unlimited resources.

Average Parallelism

Consider a parallel computer with n homogenous processor and maximum parallelism


in profile is m. In ideal case n>>m. Δ is computing capacity of single processor. The
total work W(computations performed) is given as follows
m
Here t​i​ is the total time when DOP is i and also ∑ t​i​ = t​2​-t​1​ is the total elapsed time.
i=1

The average parallelism A is given as follows

Available Parallelism
The computation that is less numeric than that in scientific codes has relatively little
parallelism even when basic block boundaries are ignored. A basic block is a sequence
or block of instructions in a program that has a single entry and a single exit points. Also
the available parallelism can be increased by compiler optimization and algorithm
redesign.
Asymptotic SpeedUp
Amount of work when DOP is i is W​i​= iΔt​i
The execution time on a single processor is t​i​(1)= W​i​/Δ
The execution time on k processors is t​i​(k)= W​i​/kΔ
With infinite number of processors t​i​(∞)= W​i​/iΔ
The response time of T(1) is the response time of a single processor system when
executing W​i​. And T(∞) is the response time of executing the same workload W​i if an
infinite number of processors is available. At this point the asymptotic speedup S​∞ is
defined as the ratio of T(1) to T(∞).

The asymptotic speedup S​∞​ is defined as follows

In ideal case S​∞​ = A

3.1.2 Mean Performance


Consider a parallel computer with n processors and m programs . The arithmetic mean
and harmonic mean is calculated as follows.
Arithmetic Mean Performance
It is defined as the ratio of the sum of all execution rates of all programs to the total
number of programs used. The execution rate of each processor is denoted as R​i and
​ is
measured in MIPS or Mflops rate. Let {R​i }​ be the execution rate of programs
i=1,2,....,m. The arithmetic mean execution rate is defined as

m
R​a​= ∑ R​i​/m
i=1

The expression R​a assumes equal weight (1/m) for all programs. If the programs are
weighted with the distribution 𝛑 = {f​i​|i=1,2,3,...,m} we define the weighted arithmetic
mean execution rate as
m
R*​a​= ∑ f​i​R​i
i=1

However the arithmetic mean execution rate fails to represent the real time consumed
by the benchmarks when they are actually executed in the presence of high anomaly
compared to geometric mean.

Harmonic Mean Performance : It measures the average execution time across a large
number of programs

The arithmetic mean execution time per instruction is as follows


m
T​a​=1/m ∑ T​i
i=1
m
T​a​=1/m ∑ 1/R​i
i=1

The harmonic mean execution rate across m benchmark programs is thus defined as
R​h​=1/T​a
m
R​h =
​ m / ( ∑ 1/R​i )

i=1

If the programs are weighted with the distribution 𝛑 = {f​i​|i=1,2,3,...,m} we define the
weighted harmonic mean execution rate as
m
R*​h =
​ 1 / ( ∑ f​i​/R​i )

i=1

Compared to arithmetic mean the harmonic mean execution rate is more closer to real
performance.

Harmonic Mean speedup

Suppose a workload of multiple programs is to be executed on an n-processor system,


the program (workload) may use different number of processors at different execution
times. The program is executed in mode i if i processors are used, R​i is the
corresponding execution rate which reflects the collective speed of i processors. The
weighted harmonic mean speedup S is defined as the ratio of the sequential execution
time T​1 to the weighted arithmetic mean execution time T​* ​across the n execution
modes.
S= T​1​/T*
n
S= 1 / ( ∑ f​i​/R​i ​)
i=1

Amdahl’s Law
This law can be derived from the previous laws under the assumption that the system
works only in two modes, fully sequential system with probability ɑ or fully parallel mode
with probability 1-ɑ. Amdahl’s speedup expression S​n equals
​ to the ratio of the total
number of processors to 1+(n-1)ɑ

S​n​ = n / 1+(n-1)ɑ​

This implies that under the above assumption, the best speedup that we can get is
upper bounded by 1/ ɑ regardless of how many processors the system actually have
because
S​n → 1/a as n → ∞. In Figure, S​n is plotted as a function of n for different values of ɑ.
Note that the ideal speedup achieved when ɑ= 0, and the speedup drops sharply as ɑ
increases

3.1.3 Efficiency, Utilization and Quality

Let O(n) be the total number of unit operations performed by n processor system. T(n)
be the execution time in unit time steps. Assume T(1) = O(1) in uniprocessor system.
The speedup factor is given as follows
S(n)= T(1)/T(n)
The system efficiency for n processor system is defined as
E(n)=S(n)/ n
E(n)= T(1)/ ( n. T(n))
The lowest efficiency corresponds to the case of the entire program code being
executed sequentially on a single processor, the other processors remain idle. The
maximum efficiency is achieved when all n processors are fully utilized throughout the
execution period.
Also 1<=S(n)<=n and 1/n <= E(n) <=1
Redundancy and Utilization
The redundancy in a parallel computation is defined as the ratio of O(n) to O(1). It
signifies the extent to which software parallelism matches the hardware parallelism
R(n)= O(n)/O(1)
The system utilization is defined as follows
U(n)= R(n). E(n)
U(n)= O(n) / (n.T(n))
The system utilization indicates the percentage of resources (processors, memories,
etc.) that was kept busy during the execution of a parallel program.
Quality of Parallelism
The quality of parallelism is directly proportional to the speedup and efficiency and
inversely related to the redundancy. Thus, we have
Q(n)= S(n). E(n) / R(n)

3.1.4 Benchmarks and Performance Measures


MFLOP is short for Mega FLoating-point Operations Per Second. MFLOPs are a
common measure of the speed of microprocessors used to perform floating-point
calculations. One simple measure is the number of instructions that can be executed
per second, expressed in millions of instructions per second or MIPS , which indicates
integer calculation performance. This MIPS rating depends on the instruction set of the
computer and program behavior. Similarly the Mflops depend upon the hardware design
and on program behavior. The conventional rating is called the native Mflops, which
does not distinguish unnormalized from normalized floating-point operations. For
example, a real floating point divide operation may correspond to four normalized
floating point operations. One needs to use a conversion table between real and
normalized floating-point operations to convert a native Mflops rating to a normalized
Mflops rating. Use normalization to achieve a fair measure of total work done.
Dhrystone
Developed by Reinhold Weicker in ​1984​, Dhrystone is a synthetic benchmark software
program used to test a computer's processor's ​integer performance. The unit
KDhrystones/s is often used in reporting the results. The Dhrystone benchmark version
1.1 was applied to a number of processors. DEC VAX11/780 scored 1.7 KDhrystones/s
performance.

Whetstone
This is a Fortran-based synthetic benchmark assessing the floating-point performance,
measured in the number of KWhetstones/s that a system can perform. The benchmark
includes both integer and floating-point operations involving array indexing, subroutine
calls, parameter passing, conditional branching, and trigonometric and transcendental
functions.Both the Dhrystone and Whetstone are synthetic benchmarks whose
performance results depend heavily on the compilers used. Both benchmarks were
criticized for being unable to predict the performance of user programs. The sensitivity
to compilers is a major drawback of these benchmarks.

The TPS and KLIPS rating


On-line transaction processing applications demand rapid, interactive processing for a
large number of relatively simple transactions. They are typically supported by very
large databases. Automated teller machines and airline reservation systems are familiar
examples. The throughput of computers for on-line transaction processing is often
measured transactions per second(TPS). Each transaction may involve a database
search, query answering, and database update operations. Business computers and
servers should be designed to deliver a high TPS rate. The TP1 benchmark was
originally proposed in I985 for measuring the transaction processing of business
application computers. This benchmark also became a standard for gauging relational
database performances.
ln artificial intelligence applications, the measure KLIPS (kilo logic inferences per
second) was used at one time to indicate the reasoning power of an AI machine. For
example, the high-speed inference machine developed under Japan's Fifth—Generation
Computer System Project claimed a performance of 400 KLIPS.

3.2 Parallel Processing Applications

3.2.1 Massive Parallelism for Grand Challenges


Any machine having hundreds and thousands of processors is massively parallel
processing(MPP) system. Some of the challenges identified in U.S. High Performance
Computing and Communication(HPCC) is as follows
1. The computers are used to study magnetostatic and related interactions to
reduce noise in metallic thin films used to coat high density disks in magnetic
recording industry.
2. High performance computers are also used in rational drug design industry to
identify the agents that block the action of human immunodeficiency virus
protease.
3. Design of high-speed transport aircraft is being aided by computational fluid
dynamics running on supercomputers.
4. Fuel combustion can be made more efficient by designing better engine models
through chemical kinetics calculations.
5. Catalysts for chemical reactions are being designed with computers for many
biological processes.
6. Ocean modeling cannot be accurate without supercomputing MPP systems.
Ozone depletion and climate research demands the use of computers in
analyzing the complex thermal, chemical and fluid dynamic mechanisms
involved.
The figure shows different computing requirements from time to time and different levels
of processing speed(along the x axis) and memory capacity(along y axis) required to
scientific simulation modeling, advanced computer-aided design (CAD), and real-time
processing of large-scale database and information retrieval operations.

Exploiting Massive Parallelism


Instruction level parallelism is executing multiple independent instructions at the same
time per cycle. Instruction parallelism is often constrained by program behavior,
compiler/ OS incapabilities, and program flow and execution mechanisms built into
modern computers.

On the other hand, data parallelism is much higher than instruction parallelism. ​Data
parallelism is parallelization across multiple ​processors and it focuses on distributing the
data across different nodes, which operate on the data in parallel. ​Data parallelism
refers to a situation where the same operation (instruction or program} executes over a
large array of data. Data parallelism has been implemented on pipelined vector
processors, SIMD array processors, and SPMD or MPMD multicomputer systems.

Some Early Representative Massively Parallel Processing Systems


MPP systems development started in 1968. Some of the systems developed are listed
below.
Illiac IV- 64 PEs under one controller.
MPP- Built by Goodyear with 16384 PEs.
GF11- Built by IBM with 576 PEs.
Another example of MPP systems operating in MIMD mode is BBN TC-2000 with a
maximum configuration of 512 processors.

3.2.2 Application Models of Parallel Computers.

Efficiency Curves
If the workload is unchanged or constant , the efficiency will decrease rapidly as shown
in the figure. Hence it is necessary to increase the machine size and problem size
proportionally. Such a system is known as a scalable computer for solving scalable
problems. The efficiency curve is denoted as 𝛼 which corresponds to Amdahl’s law
If the workload is linear it is ideal case and efficiency curve denoted by 𝛾 is almost flat.
If the workload is not linear , the second choice is to have sublinear scalability as close
to linearity as possible as illustrated by curve 𝜷 in figure. The sublinear efficiency curve
𝜷 lies somewhere between curves 𝛼 and 𝛾.
If the workload has exponential growth pattern the system is poorly scalable. The
reason is that to keep a constant efficiency or a good speedup, the increase in workload
with problem size becomes explosive and exceeds the memory or IO limits. Thus the
efficiency curve 𝛉 is achievable only with exponentially increased memory capacity.
Application Models: There are three speedup performance models defined below which
are bounded by limited memory, limited tolerance for latency of IPC and limited IO
bandwidth.

Fixed Load Model: It corresponds to a constant workload with efficiency curve 𝛼 . It is


limited by communication bound shown in shaded area in figure.
Fixed Time Model: It corresponds to linear workload with efficiency curve 𝛾.
Fixed Memory Model: It corresponds to a workload between the curve 𝛾 and 𝛉. It is
limited by memory bound shown in shaded area in figure.

Scalability Analysis
Scalability analysis determines whether parallel processing of a given problem can offer
the desired improvement in performance. The analysis should help guide the design of
a massively parallel processor.
Some tradeoffs in scalability analysis
1. Computer cost and programming overhead should be considered for Scalability
analysis.
2. Also multi user environment should be considered for scalability analysis where
multiple programs are executed concurrently by sharing the available resources.

3.2.3 Scalability of Parallel Algorithms

Parallel algorithms are specially designed for parallel computers. ​The characteristics of
parallel algorithms are listed below.
1. Deterministic algorithm, for a given particular input, the computer will always
produce the same output but in the case of non-deterministic algorithm, for the
same input, the compiler may produce different outputs in different runs. Hence
Deterministic algorithms are implementable on real machines and have
polynomial time complexity.
2. Computational granularity :Granularity decides the size of data items and
program modules used in computation. in this sense, we also classily algorithms
as fine-grain, medium -grain, or coarse-grain.​In fine-grained parallelism, a
program is broken down to a large number of small tasks. ​In coarse-grained
parallelism, a program is split into large tasks.Medium-grained parallelism is a
compromise between fine-grained and coarse-grained parallelism, where we
have task size and communication time greater than fine-grained parallelism and
lower than coarse-grained parallelism. Most general-purpose parallel computers
fall in this category.
3. Parallelism Profile: The effectiveness of parallel algorithms is determined by the
distribution of degree of parallelism.
4. Communication patterns and synchronization requirements: Communication
patterns address both memory access and interprocessor communications. The
partems can he static or dynamic. Static algorithms are more suitable for SIMD
or pipelined machines, while dynamic algorithms are for MIMD machines. The
synchronization frequency also affects the efficiency of algorithm.
5. Uniformity of Operations: It indicates that same or uniform operations are
performed on the dataset and data is partitioned across various parallel nodes for
data level parallelism. Suitable for SIMD processing or pipelining.
6. Memory requirement and data structures: Memory efficiency gets affected by the
usage of different data structures and data movement patterns in algorithm. The
efficiency of parallel algorithm can be computed by analyzing the space and time
complexity.

Isoefficiency

Let w be workload and s is the problem size. w=w(s). The size of machine is n. Let h be
the communication overhead for parallel algorithm and it is function of both machine
size n and problem size s.
h=h(s,n)
The efficiency of parallel algorithm implemented on parallel computer is
E= w(s)/ (w(s) + h(s,n))

A small isoefficiency function means that small increments in the problem size are
sufficient for the efficient utilization of an increasing number of processing elements,
indicating that the parallel system is highly scalable. However, a large isoefficiency
function indicates a poorly scalable parallel system.
In general the overhead h increase with increasing values of n and s. According to the
concept of isoefficiency efficiency should remain constant when implementing a parallel
algorithm on parallel computer.

In order to maintain the constant efficiency E the workload w(s) should grow in
proportion to overhead h(s,n) . The isoefficiency function of common parallel algorithm
are polynomial functions of n i.e. they are O( nk ) for some k>=1. The smaller the power
of n the more scalable the parallel system.
w(s)= ( E / (1-E) ) * h(s,n)

The factor C= E / (1-E) is constant for fixed efficiency E. Thus we define the
isoefficiency function as follows. Iso efficiency function is used for scalability analysis.

f​E​(n) = C*h(s,n)

The isoefficiency function of four matrix multiplication algorithm is given below


3.3 Speedup Performance Laws
Three speedup performance models are described below
3.3.1 Amdahl’s law for fixed workload
The time critical applications where the goal is to minimize the turnaround time provided
the major motivation behind the development of fixed load speedup model and
Amdahl’s law.
Here we consider a fixed workload and hence as the number of processors increases in
parallel computer the workload is distributed to more processors for parallel execution.
Consider Degree of Parallelism(DOP) as DOP= i and i> =n where n is the number of
processor/machine size. Assume the workload as W​i and it is executed by all n
processors and m is maximum parallelism observed and Δ is computing capacity of
each processor.
The execution time of W​i​ is

The response time is


Now we define the fixed load speedup factor S​n as ratio of T(1) to T(n). T(1) indicates
response time for the uniprocessor system and hence
m
T(1) = ∑ W i/ Δ
i=1

Let Q(n) be overhead contributed by interprocessor communication, latencies of


memory access etc.Hence we can rewrite S​n​ as

Also Q(n) is dependent on application as well as machine.

Consider a situation where the system can operate in only two modes i.e. sequential
mode with DOP=1 and perfectly parallel mode with DOP=n Then S​n​ is given as follows

Hence execution time reduces when the parallel portion of the program is executed by n
processors, but the sequential portion of the program does does not change and
requires constant amount of time as shown in figure below. Thus
W​1​+W​n​= α + (1 − α) = 1
Here α represents the percentage of the program that must be executed sequentially
and 1- α corresponds to portion of the code that can be executed in parallel.The total
amount of workload W​1​+W​n ​ is kept constant.
The speedup curve decreases very rapidly as the α increases. This means that with a
small percentage of the sequential code the entire performance cannot go higher than
1/ α . Hence this α is the sequential bottleneck in the program which cannot be solved by
increasing the number of processors.

3.3.2 Gustafson’s Law for Scaled problems.


Here the workload is not constant. Here the problem size increases and relatively
machine size is upgraded to obtain more computing power and provide a more accurate
solution by keeping the execution time unchanged.
The main advantage is not in saving time but in producing much more accurate solution.
This problem scaling for accuracy has motivated Gustafson to develop a fixed-time
speedup model. The scaled problem keeps all the increased resources busy, resulting
in a better system utilization ratio.

Fixed Time Speedup


Let m′ be the maximum DOP with respect to the scaled problem W i ′ be the scaled
workload with DOP=i

The fixed time speedup is defined under the assumption that T(1)= T’(n). The general
formula for fixed time speedup is given below.

Gustafson’s law

The Gustafson’s law can be restated in terms of α as given below.

As shown in the figure below, the workload increases corresponding to the increase in
number of processors and execution time remains unchanged or constant.
Note that the slope of the S​n curve
​ in figure is much flatter .This implies that Gustafson‘s
law does support scalable performance as the machine size increases. The idea is to
keep all processors busy by increasing the problem size. When the problem can scale
to match available computing power, the sequential fraction is no longer a bottleneck.
3.3.3 Memory-Bounded Speedup Model
Some scientific and engineering applications require a lot of memory space. The idea
here is to solve a problem of largest size, which is limited by the memory capacity. This
model may result in an increase in execution time to achieve scalable performance. The
fixed-memory model assumes a scaled workload and allows an increase in execution
time. The increase in workload (problem size) is memory bound.
Fixed Memory Speedup
Let M be the memory requirement of a given problem and W be the computational
workload. W=g(M) and M= g −1 (W )

Consider that the system operates in two modes i.e. sequential and perfectly parallel
mode. The speedup equation is given as follows

W *n is the scaled workload and W *n = g * (nM) , where nM is the increased memory


capacity for an n-node multicomputer. The factor G(n) reflects the increase in workload
as memory increases n times.

g * (nM)= G(n). g(M)= G(n)W​n

The above speedup equation is applicable for the following three cases.
Case 1: G(n)=1 This corresponds to thc case where the problem size is fixed.
Case 2: G(n) =nThis applies to the case where the workload increases n times when the
memory is increased n times.
Case 3:G(n)>n This corresponds to the situation where the computational workload
increases faster than the memory requirement.

3.4 SCALABILITY ANALYSIS AND APPROACHES


The simplest definition of scalability is that the performance of a computer system
should increase linearly with respect to the number of processors used for a given
application. Scalability analysis of a given computer system must be conducted for a
given application program/algorithm.The analysis can be performed under different
constraints on the growth of the problem size (workload) and on the machine size
(number of processors).

3.4.1 Scalability Metrics and Goals


Identified below are the basic metrics or factors affecting the scalability of a computer
system for a given application:
1. Machine Size(n): The number of processors in parallel computer system affects
the scalability. Large machine size implies more computing power and more
resources.
2. Clock Rate(f) : The clock rate is 1/(clock cycle time). The clock rate should be
scalable depending on the better technologies.
3. CPU Time(T): It is execution time and denoted as t(s,n) where s is the size of
problem and n is number of processors in parallel machine.
4. I/O Demand(d): The I/O requests for moving the data, program and results can
overlap with CPU operations in multiprogrammed environment.
5. Memory Capacity(m) : The memory requirement changes during the execution of
the program depending upon the size of problem, algorithm etc. The physical
memory is limited, but virtual memory is almost unlimited.
6. Communication Overhead(h): The latency due to interprocessor communication,
synchronization and remote memory access. Overhead is h(s,n), where s is
problem size and n is number of processors.
7. Computer cost(c): The total cost of hardware and software resources.
8. Programming Overhead(p): Programming overhead may slow down the software
productivity and thus implies high cost.

SpeedUp and Efficiency Revisited

T(s,1) is the sequential execution time on uniprocessor system and T(s,n) is the
execution time on parallel machine with n processor. The asymptotic speedup is S(s,n)
and is defined as follows
S(s,n)= T(s,1) / ( T(s,n) + h(s,n))
Where h(s,n) : communication overhead plus I/O overhead
T(s,n) can be minimized by increasing the number of processors.
The system efficiency is given as follows
E(s,n)= S(s,n)/n
The best possible efficiency is one, implying that the best speedup is linear or S(s,n)=n
A system it scalable if the system efficiency E(s,n) = 1 for all algorithms with any
number of n processors and any problem size s

2.3 Program Flow Mechanisms


There are three program flow mechanisms given below
Control Flow versus Data Flow
In Control Flow computers the instructions are executed in order as stated explicitly in
program.Hence in Control Flow computers the Program Counter(PC) keeps track of the
next instruction to be executed and program flow is controlled by the programmer
explicitly.

In Data flow computers the instructions will get executed when the data or operands are
available.Thus in data flow instruction will be ready for execution when the operands
become available. Thus the data tokens are passed directly between the instructions.
The data generated by an instruction will be duplicated into multiple copies and will be
forwarded to all needy instructions. Data token once consumed by an instruction will no
longer be available for reuse. It does not require a PC or control sequencer. But
mechanisms should exist to check data or operand availability and match data tokens
with needy instructions.
The architecture of MIT tagged token dataflow computer is shown below.

There are n processing elements(PE) interconnected by n* n routing network. Within


each PE the low level token matching mechanism dispatches those instructions whose
data or operands are available from program memory. Each datum is tagged with the
address of the instruction to which it belongs and the context in which the instruction is
being executed.The tokens can also be passed to other PEs through the routing
network. All internal token circulation operations are pipelined without blocking.Another
synchronization mechanism, called the I-structure , is provided within each PE. The
I-structure is a tagged memory unit for overlapped usage of data structure by both the
producer and consumer processes.

Demand Driven Mechanisms

Demand -driven computation corresponds to lazy evaluation because operations are


executed only when their results are required by another instruction. Consider the
following expression

a= ((b+1)*c-(d/e))

Two reduction models for Demand Driven Mechanisms are given below
In a string reduction model, each demander gets a separate copy of the expression for
its own evaluation. A long string expression is reduced to a single value in a recursive
fashion.An expression is said to be fully reduced when all the arguments have been
replaced by literal values.
In a graph reduction model, the expression is represented as a directed graph. The
graph is reduced by evaluation of branches or subgraphs. Different parts of a graph or
subgraphs can be reduced or evaluated in parallel upon demand.

2.3.3 Comparison of Flow Mechanisms

You might also like