0% found this document useful (0 votes)

14 views

15CS72 ACA Module1 Chapter3FinalCopy

This document discusses principles of scalable performance in parallel computing, focusing on performance metrics such as degree of parallelism, average parallelism, and efficiency. It also covers benchmarks for measuring performance, including MFLOPS, MIPS, and synthetic benchmarks like Dhrystone and Whetstone. Additionally, it explores the applications of massive parallelism in various fields and the characteristics of parallel algorithms, emphasizing the importance of scalability and efficiency in parallel processing.

Uploaded by

Tarun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

15CS72 ACA Module1 Chapter3FinalCopy

Uploaded by

Tarun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Module 1: Chapter 3

Principles of Scalable Performance

3.1 Performance metrics and measures

3.1.1 Parallelism profile in programs

Degree of Parallelism

The execution of a program on a parallel computer may use different numbers of

processors at different time periods during the execution cycle. For each time period,
the number of processors used to execute a program is defined as the degree of
Parallelism (DOP). This is a discrete time function, assuming only nonnegative integer
values.

The plot of the DOP as a function of time is called the parallelism profile of a given
program. The profile will fluctuate depending upon algorithmic structure, program
optimization, resource utilization and run time conditions of computer system. DOP is
defined under the assumptions of having unlimited resources.

Average Parallelism

Consider a parallel computer with n homogenous processor and maximum parallelism

in profile is m. In ideal case n>>m. Δ is computing capacity of single processor. The
total work W(computations performed) is given as follows
m
Here ti is the total time when DOP is i and also ∑ ti = t2-t1 is the total elapsed time.
i=1

The average parallelism A is given as follows

Available Parallelism
The computation that is less numeric than that in scientific codes has relatively little
parallelism even when basic block boundaries are ignored. A basic block is a sequence
or block of instructions in a program that has a single entry and a single exit points. Also
the available parallelism can be increased by compiler optimization and algorithm
redesign.
Asymptotic SpeedUp
Amount of work when DOP is i is Wi= iΔti
The execution time on a single processor is ti(1)= Wi/Δ
The execution time on k processors is ti(k)= Wi/kΔ
With infinite number of processors ti(∞)= Wi/iΔ
The response time of T(1) is the response time of a single processor system when
executing Wi. And T(∞) is the response time of executing the same workload Wi if an
infinite number of processors is available. At this point the asymptotic speedup S∞ is
defined as the ratio of T(1) to T(∞).

The asymptotic speedup S∞ is defined as follows

In ideal case S∞ = A

3.1.2 Mean Performance

Consider a parallel computer with n processors and m programs . The arithmetic mean
and harmonic mean is calculated as follows.
Arithmetic Mean Performance
It is defined as the ratio of the sum of all execution rates of all programs to the total
number of programs used. The execution rate of each processor is denoted as Ri and
is
measured in MIPS or Mflops rate. Let {Ri } be the execution rate of programs
i=1,2,....,m. The arithmetic mean execution rate is defined as

m
Ra= ∑ Ri/m
i=1

The expression Ra assumes equal weight (1/m) for all programs. If the programs are
weighted with the distribution 𝛑 = {fi|i=1,2,3,...,m} we define the weighted arithmetic
mean execution rate as
m
R*a= ∑ fiRi
i=1

However the arithmetic mean execution rate fails to represent the real time consumed
by the benchmarks when they are actually executed in the presence of high anomaly
compared to geometric mean.

Harmonic Mean Performance : It measures the average execution time across a large
number of programs

The arithmetic mean execution time per instruction is as follows

m
Ta=1/m ∑ Ti
i=1
m
Ta=1/m ∑ 1/Ri
i=1

The harmonic mean execution rate across m benchmark programs is thus defined as
Rh=1/Ta
m
Rh =
m / ( ∑ 1/Ri )

i=1

If the programs are weighted with the distribution 𝛑 = {fi|i=1,2,3,...,m} we define the
weighted harmonic mean execution rate as
m
R*h =
1 / ( ∑ fi/Ri )

i=1

Compared to arithmetic mean the harmonic mean execution rate is more closer to real
performance.

Harmonic Mean speedup

Suppose a workload of multiple programs is to be executed on an n-processor system,

the program (workload) may use different number of processors at different execution
times. The program is executed in mode i if i processors are used, Ri is the
corresponding execution rate which reflects the collective speed of i processors. The
weighted harmonic mean speedup S is defined as the ratio of the sequential execution
time T1 to the weighted arithmetic mean execution time T* across the n execution
modes.
S= T1/T*
n
S= 1 / ( ∑ fi/Ri )
i=1

Amdahl’s Law
This law can be derived from the previous laws under the assumption that the system
works only in two modes, fully sequential system with probability ɑ or fully parallel mode
with probability 1-ɑ. Amdahl’s speedup expression Sn equals
to the ratio of the total
number of processors to 1+(n-1)ɑ

Sn = n / 1+(n-1)ɑ

This implies that under the above assumption, the best speedup that we can get is
upper bounded by 1/ ɑ regardless of how many processors the system actually have
because
Sn → 1/a as n → ∞. In Figure, Sn is plotted as a function of n for different values of ɑ.
Note that the ideal speedup achieved when ɑ= 0, and the speedup drops sharply as ɑ
increases

3.1.3 Efficiency, Utilization and Quality

Let O(n) be the total number of unit operations performed by n processor system. T(n)
be the execution time in unit time steps. Assume T(1) = O(1) in uniprocessor system.
The speedup factor is given as follows
S(n)= T(1)/T(n)
The system efficiency for n processor system is defined as
E(n)=S(n)/ n
E(n)= T(1)/ ( n. T(n))
The lowest efficiency corresponds to the case of the entire program code being
executed sequentially on a single processor, the other processors remain idle. The
maximum efficiency is achieved when all n processors are fully utilized throughout the
execution period.
Also 1<=S(n)<=n and 1/n <= E(n) <=1
Redundancy and Utilization
The redundancy in a parallel computation is defined as the ratio of O(n) to O(1). It
signifies the extent to which software parallelism matches the hardware parallelism
R(n)= O(n)/O(1)
The system utilization is defined as follows
U(n)= R(n). E(n)
U(n)= O(n) / (n.T(n))
The system utilization indicates the percentage of resources (processors, memories,
etc.) that was kept busy during the execution of a parallel program.
Quality of Parallelism
The quality of parallelism is directly proportional to the speedup and efficiency and
inversely related to the redundancy. Thus, we have
Q(n)= S(n). E(n) / R(n)

3.1.4 Benchmarks and Performance Measures

MFLOP is short for Mega FLoating-point Operations Per Second. MFLOPs are a
common measure of the speed of microprocessors used to perform floating-point
calculations. One simple measure is the number of instructions that can be executed
per second, expressed in millions of instructions per second or MIPS , which indicates
integer calculation performance. This MIPS rating depends on the instruction set of the
computer and program behavior. Similarly the Mflops depend upon the hardware design
and on program behavior. The conventional rating is called the native Mflops, which
does not distinguish unnormalized from normalized floating-point operations. For
example, a real floating point divide operation may correspond to four normalized
floating point operations. One needs to use a conversion table between real and
normalized floating-point operations to convert a native Mflops rating to a normalized
Mflops rating. Use normalization to achieve a fair measure of total work done.
Dhrystone
Developed by Reinhold Weicker in 1984, Dhrystone is a synthetic benchmark software
program used to test a computer's processor's integer performance. The unit
KDhrystones/s is often used in reporting the results. The Dhrystone benchmark version
1.1 was applied to a number of processors. DEC VAX11/780 scored 1.7 KDhrystones/s
performance.

Whetstone
This is a Fortran-based synthetic benchmark assessing the floating-point performance,
measured in the number of KWhetstones/s that a system can perform. The benchmark
includes both integer and floating-point operations involving array indexing, subroutine
calls, parameter passing, conditional branching, and trigonometric and transcendental
functions.Both the Dhrystone and Whetstone are synthetic benchmarks whose
performance results depend heavily on the compilers used. Both benchmarks were
criticized for being unable to predict the performance of user programs. The sensitivity
to compilers is a major drawback of these benchmarks.

The TPS and KLIPS rating

On-line transaction processing applications demand rapid, interactive processing for a
large number of relatively simple transactions. They are typically supported by very
large databases. Automated teller machines and airline reservation systems are familiar
examples. The throughput of computers for on-line transaction processing is often
measured transactions per second(TPS). Each transaction may involve a database
search, query answering, and database update operations. Business computers and
servers should be designed to deliver a high TPS rate. The TP1 benchmark was
originally proposed in I985 for measuring the transaction processing of business
application computers. This benchmark also became a standard for gauging relational
database performances.
ln artificial intelligence applications, the measure KLIPS (kilo logic inferences per
second) was used at one time to indicate the reasoning power of an AI machine. For
example, the high-speed inference machine developed under Japan's Fifth—Generation
Computer System Project claimed a performance of 400 KLIPS.

3.2 Parallel Processing Applications

3.2.1 Massive Parallelism for Grand Challenges

Any machine having hundreds and thousands of processors is massively parallel
processing(MPP) system. Some of the challenges identified in U.S. High Performance
Computing and Communication(HPCC) is as follows
1. The computers are used to study magnetostatic and related interactions to
reduce noise in metallic thin films used to coat high density disks in magnetic
recording industry.
2. High performance computers are also used in rational drug design industry to
identify the agents that block the action of human immunodeficiency virus
protease.
3. Design of high-speed transport aircraft is being aided by computational fluid
dynamics running on supercomputers.
4. Fuel combustion can be made more efficient by designing better engine models
through chemical kinetics calculations.
5. Catalysts for chemical reactions are being designed with computers for many
biological processes.
6. Ocean modeling cannot be accurate without supercomputing MPP systems.
Ozone depletion and climate research demands the use of computers in
analyzing the complex thermal, chemical and fluid dynamic mechanisms
involved.
The figure shows different computing requirements from time to time and different levels
of processing speed(along the x axis) and memory capacity(along y axis) required to
scientific simulation modeling, advanced computer-aided design (CAD), and real-time
processing of large-scale database and information retrieval operations.

Exploiting Massive Parallelism

Instruction level parallelism is executing multiple independent instructions at the same
time per cycle. Instruction parallelism is often constrained by program behavior,
compiler/ OS incapabilities, and program flow and execution mechanisms built into
modern computers.

On the other hand, data parallelism is much higher than instruction parallelism. Data
parallelism is parallelization across multiple processors and it focuses on distributing the
data across different nodes, which operate on the data in parallel. Data parallelism
refers to a situation where the same operation (instruction or program} executes over a
large array of data. Data parallelism has been implemented on pipelined vector
processors, SIMD array processors, and SPMD or MPMD multicomputer systems.

Some Early Representative Massively Parallel Processing Systems

MPP systems development started in 1968. Some of the systems developed are listed
below.
Illiac IV- 64 PEs under one controller.
MPP- Built by Goodyear with 16384 PEs.
GF11- Built by IBM with 576 PEs.
Another example of MPP systems operating in MIMD mode is BBN TC-2000 with a
maximum configuration of 512 processors.

3.2.2 Application Models of Parallel Computers.

Efficiency Curves
If the workload is unchanged or constant , the efficiency will decrease rapidly as shown
in the figure. Hence it is necessary to increase the machine size and problem size
proportionally. Such a system is known as a scalable computer for solving scalable
problems. The efficiency curve is denoted as 𝛼 which corresponds to Amdahl’s law
If the workload is linear it is ideal case and efficiency curve denoted by 𝛾 is almost flat.
If the workload is not linear , the second choice is to have sublinear scalability as close
to linearity as possible as illustrated by curve 𝜷 in figure. The sublinear efficiency curve
𝜷 lies somewhere between curves 𝛼 and 𝛾.
If the workload has exponential growth pattern the system is poorly scalable. The
reason is that to keep a constant efficiency or a good speedup, the increase in workload
with problem size becomes explosive and exceeds the memory or IO limits. Thus the
efficiency curve 𝛉 is achievable only with exponentially increased memory capacity.
Application Models: There are three speedup performance models defined below which
are bounded by limited memory, limited tolerance for latency of IPC and limited IO
bandwidth.

Fixed Load Model: It corresponds to a constant workload with efficiency curve 𝛼 . It is

limited by communication bound shown in shaded area in figure.
Fixed Time Model: It corresponds to linear workload with efficiency curve 𝛾.
Fixed Memory Model: It corresponds to a workload between the curve 𝛾 and 𝛉. It is
limited by memory bound shown in shaded area in figure.

Scalability Analysis
Scalability analysis determines whether parallel processing of a given problem can offer
the desired improvement in performance. The analysis should help guide the design of
a massively parallel processor.
Some tradeoffs in scalability analysis
1. Computer cost and programming overhead should be considered for Scalability
analysis.
2. Also multi user environment should be considered for scalability analysis where
multiple programs are executed concurrently by sharing the available resources.

3.2.3 Scalability of Parallel Algorithms

Parallel algorithms are specially designed for parallel computers. The characteristics of
parallel algorithms are listed below.
1. Deterministic algorithm, for a given particular input, the computer will always
produce the same output but in the case of non-deterministic algorithm, for the
same input, the compiler may produce different outputs in different runs. Hence
Deterministic algorithms are implementable on real machines and have
polynomial time complexity.
2. Computational granularity :Granularity decides the size of data items and
program modules used in computation. in this sense, we also classily algorithms
as fine-grain, medium -grain, or coarse-grain.In fine-grained parallelism, a
program is broken down to a large number of small tasks. In coarse-grained
parallelism, a program is split into large tasks.Medium-grained parallelism is a
compromise between fine-grained and coarse-grained parallelism, where we
have task size and communication time greater than fine-grained parallelism and
lower than coarse-grained parallelism. Most general-purpose parallel computers
fall in this category.
3. Parallelism Profile: The effectiveness of parallel algorithms is determined by the
distribution of degree of parallelism.
4. Communication patterns and synchronization requirements: Communication
patterns address both memory access and interprocessor communications. The
partems can he static or dynamic. Static algorithms are more suitable for SIMD
or pipelined machines, while dynamic algorithms are for MIMD machines. The
synchronization frequency also affects the efficiency of algorithm.
5. Uniformity of Operations: It indicates that same or uniform operations are
performed on the dataset and data is partitioned across various parallel nodes for
data level parallelism. Suitable for SIMD processing or pipelining.
6. Memory requirement and data structures: Memory efficiency gets affected by the
usage of different data structures and data movement patterns in algorithm. The
efficiency of parallel algorithm can be computed by analyzing the space and time
complexity.

Isoefficiency

Let w be workload and s is the problem size. w=w(s). The size of machine is n. Let h be
the communication overhead for parallel algorithm and it is function of both machine
size n and problem size s.
h=h(s,n)
The efficiency of parallel algorithm implemented on parallel computer is
E= w(s)/ (w(s) + h(s,n))

A small isoefficiency function means that small increments in the problem size are
sufficient for the efficient utilization of an increasing number of processing elements,
indicating that the parallel system is highly scalable. However, a large isoefficiency
function indicates a poorly scalable parallel system.
In general the overhead h increase with increasing values of n and s. According to the
concept of isoefficiency efficiency should remain constant when implementing a parallel
algorithm on parallel computer.

In order to maintain the constant efficiency E the workload w(s) should grow in
proportion to overhead h(s,n) . The isoefficiency function of common parallel algorithm
are polynomial functions of n i.e. they are O( nk ) for some k>=1. The smaller the power
of n the more scalable the parallel system.
w(s)= ( E / (1-E) ) * h(s,n)

The factor C= E / (1-E) is constant for fixed efficiency E. Thus we define the
isoefficiency function as follows. Iso efficiency function is used for scalability analysis.

fE(n) = C*h(s,n)

The isoefficiency function of four matrix multiplication algorithm is given below

3.3 Speedup Performance Laws
Three speedup performance models are described below
3.3.1 Amdahl’s law for fixed workload
The time critical applications where the goal is to minimize the turnaround time provided
the major motivation behind the development of fixed load speedup model and
Amdahl’s law.
Here we consider a fixed workload and hence as the number of processors increases in
parallel computer the workload is distributed to more processors for parallel execution.
Consider Degree of Parallelism(DOP) as DOP= i and i> =n where n is the number of
processor/machine size. Assume the workload as Wi and it is executed by all n
processors and m is maximum parallelism observed and Δ is computing capacity of
each processor.
The execution time of Wi is

The response time is

Now we define the fixed load speedup factor Sn as ratio of T(1) to T(n). T(1) indicates
response time for the uniprocessor system and hence
m
T(1) = ∑ W i/ Δ
i=1

Let Q(n) be overhead contributed by interprocessor communication, latencies of

memory access etc.Hence we can rewrite Sn as

Also Q(n) is dependent on application as well as machine.

Consider a situation where the system can operate in only two modes i.e. sequential
mode with DOP=1 and perfectly parallel mode with DOP=n Then Sn is given as follows

Hence execution time reduces when the parallel portion of the program is executed by n
processors, but the sequential portion of the program does does not change and
requires constant amount of time as shown in figure below. Thus
W1+Wn= α + (1 − α) = 1
Here α represents the percentage of the program that must be executed sequentially
and 1- α corresponds to portion of the code that can be executed in parallel.The total
amount of workload W1+Wn is kept constant.
The speedup curve decreases very rapidly as the α increases. This means that with a
small percentage of the sequential code the entire performance cannot go higher than
1/ α . Hence this α is the sequential bottleneck in the program which cannot be solved by
increasing the number of processors.

3.3.2 Gustafson’s Law for Scaled problems.

Here the workload is not constant. Here the problem size increases and relatively
machine size is upgraded to obtain more computing power and provide a more accurate
solution by keeping the execution time unchanged.
The main advantage is not in saving time but in producing much more accurate solution.
This problem scaling for accuracy has motivated Gustafson to develop a fixed-time
speedup model. The scaled problem keeps all the increased resources busy, resulting
in a better system utilization ratio.

Fixed Time Speedup

Let m′ be the maximum DOP with respect to the scaled problem W i ′ be the scaled
workload with DOP=i

The fixed time speedup is defined under the assumption that T(1)= T’(n). The general
formula for fixed time speedup is given below.

Gustafson’s law

The Gustafson’s law can be restated in terms of α as given below.

As shown in the figure below, the workload increases corresponding to the increase in
number of processors and execution time remains unchanged or constant.
Note that the slope of the Sn curve
in figure is much flatter .This implies that Gustafson‘s
law does support scalable performance as the machine size increases. The idea is to
keep all processors busy by increasing the problem size. When the problem can scale
to match available computing power, the sequential fraction is no longer a bottleneck.
3.3.3 Memory-Bounded Speedup Model
Some scientific and engineering applications require a lot of memory space. The idea
here is to solve a problem of largest size, which is limited by the memory capacity. This
model may result in an increase in execution time to achieve scalable performance. The
fixed-memory model assumes a scaled workload and allows an increase in execution
time. The increase in workload (problem size) is memory bound.
Fixed Memory Speedup
Let M be the memory requirement of a given problem and W be the computational
workload. W=g(M) and M= g −1 (W )

Consider that the system operates in two modes i.e. sequential and perfectly parallel
mode. The speedup equation is given as follows

W n is the scaled workload and W n = g * (nM) , where nM is the increased memory

capacity for an n-node multicomputer. The factor G(n) reflects the increase in workload
as memory increases n times.

g * (nM)= G(n). g(M)= G(n)Wn

The above speedup equation is applicable for the following three cases.
Case 1: G(n)=1 This corresponds to thc case where the problem size is fixed.
Case 2: G(n) =nThis applies to the case where the workload increases n times when the
memory is increased n times.
Case 3:G(n)>n This corresponds to the situation where the computational workload
increases faster than the memory requirement.

3.4 SCALABILITY ANALYSIS AND APPROACHES

The simplest definition of scalability is that the performance of a computer system
should increase linearly with respect to the number of processors used for a given
application. Scalability analysis of a given computer system must be conducted for a
given application program/algorithm.The analysis can be performed under different
constraints on the growth of the problem size (workload) and on the machine size
(number of processors).

3.4.1 Scalability Metrics and Goals

Identified below are the basic metrics or factors affecting the scalability of a computer
system for a given application:
1. Machine Size(n): The number of processors in parallel computer system affects
the scalability. Large machine size implies more computing power and more
resources.
2. Clock Rate(f) : The clock rate is 1/(clock cycle time). The clock rate should be
scalable depending on the better technologies.
3. CPU Time(T): It is execution time and denoted as t(s,n) where s is the size of
problem and n is number of processors in parallel machine.
4. I/O Demand(d): The I/O requests for moving the data, program and results can
overlap with CPU operations in multiprogrammed environment.
5. Memory Capacity(m) : The memory requirement changes during the execution of
the program depending upon the size of problem, algorithm etc. The physical
memory is limited, but virtual memory is almost unlimited.
6. Communication Overhead(h): The latency due to interprocessor communication,
synchronization and remote memory access. Overhead is h(s,n), where s is
problem size and n is number of processors.
7. Computer cost(c): The total cost of hardware and software resources.
8. Programming Overhead(p): Programming overhead may slow down the software
productivity and thus implies high cost.

SpeedUp and Efficiency Revisited

T(s,1) is the sequential execution time on uniprocessor system and T(s,n) is the
execution time on parallel machine with n processor. The asymptotic speedup is S(s,n)
and is defined as follows
S(s,n)= T(s,1) / ( T(s,n) + h(s,n))
Where h(s,n) : communication overhead plus I/O overhead
T(s,n) can be minimized by increasing the number of processors.
The system efficiency is given as follows
E(s,n)= S(s,n)/n
The best possible efficiency is one, implying that the best speedup is linear or S(s,n)=n
A system it scalable if the system efficiency E(s,n) = 1 for all algorithms with any
number of n processors and any problem size s

2.3 Program Flow Mechanisms

There are three program flow mechanisms given below
Control Flow versus Data Flow
In Control Flow computers the instructions are executed in order as stated explicitly in
program.Hence in Control Flow computers the Program Counter(PC) keeps track of the
next instruction to be executed and program flow is controlled by the programmer
explicitly.

In Data flow computers the instructions will get executed when the data or operands are
available.Thus in data flow instruction will be ready for execution when the operands
become available. Thus the data tokens are passed directly between the instructions.
The data generated by an instruction will be duplicated into multiple copies and will be
forwarded to all needy instructions. Data token once consumed by an instruction will no
longer be available for reuse. It does not require a PC or control sequencer. But
mechanisms should exist to check data or operand availability and match data tokens
with needy instructions.
The architecture of MIT tagged token dataflow computer is shown below.

There are n processing elements(PE) interconnected by n* n routing network. Within

each PE the low level token matching mechanism dispatches those instructions whose
data or operands are available from program memory. Each datum is tagged with the
address of the instruction to which it belongs and the context in which the instruction is
being executed.The tokens can also be passed to other PEs through the routing
network. All internal token circulation operations are pipelined without blocking.Another
synchronization mechanism, called the I-structure , is provided within each PE. The
I-structure is a tagged memory unit for overlapped usage of data structure by both the
producer and consumer processes.

Demand Driven Mechanisms

Demand -driven computation corresponds to lazy evaluation because operations are

executed only when their results are required by another instruction. Consider the
following expression

a= ((b+1)*c-(d/e))

Two reduction models for Demand Driven Mechanisms are given below
In a string reduction model, each demander gets a separate copy of the expression for
its own evaluation. A long string expression is reduced to a single value in a recursive
fashion.An expression is said to be fully reduced when all the arguments have been
replaced by literal values.
In a graph reduction model, the expression is represented as a directed graph. The
graph is reduced by evaluation of branches or subgraphs. Different parts of a graph or
subgraphs can be reduced or evaluated in parallel upon demand.

2.3.3 Comparison of Flow Mechanisms

IFT 266 Introduction To Network Information Communication Technology Lab 17 The Hidden Network
0% (1)
IFT 266 Introduction To Network Information Communication Technology Lab 17 The Hidden Network
4 pages
NA 2 Notes 3 - Power Transmission
0% (1)
NA 2 Notes 3 - Power Transmission
8 pages
Waterfall Model of Ericsson Company
No ratings yet
Waterfall Model of Ericsson Company
2 pages
Principles of Scalable Performance
0% (1)
Principles of Scalable Performance
7 pages
Performance and Scalability Class
No ratings yet
Performance and Scalability Class
63 pages
CSCI 8150 Advanced Computer Architecture
No ratings yet
CSCI 8150 Advanced Computer Architecture
26 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
34 pages
Performance&Scalability Ch3
No ratings yet
Performance&Scalability Ch3
41 pages
Chapter 3: Principles of Scalable Performance
No ratings yet
Chapter 3: Principles of Scalable Performance
41 pages
Performance Measures and Metrics
0% (1)
Performance Measures and Metrics
21 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
OOAD
No ratings yet
OOAD
67 pages
Module 1 Chapter3
No ratings yet
Module 1 Chapter3
45 pages
Week_7 (1)
No ratings yet
Week_7 (1)
27 pages
Module 3
No ratings yet
Module 3
23 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
Unit 4
No ratings yet
Unit 4
64 pages
3.2 Performance Evaluations
No ratings yet
3.2 Performance Evaluations
18 pages
Unit 2 Performance Evaluations: Structure Nos
No ratings yet
Unit 2 Performance Evaluations: Structure Nos
18 pages
CS-3006_4_PerformanceAnalysis
No ratings yet
CS-3006_4_PerformanceAnalysis
62 pages
Exercises Chap 2
No ratings yet
Exercises Chap 2
4 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
36 pages
12 MPIProgramPerformance
No ratings yet
12 MPIProgramPerformance
33 pages
Performance: Latency
No ratings yet
Performance: Latency
7 pages
PDC Week 2 (Performance Metrice, Amdahl's Law)
No ratings yet
PDC Week 2 (Performance Metrice, Amdahl's Law)
18 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
HPC 4th Unit - 240504 - 160030
No ratings yet
HPC 4th Unit - 240504 - 160030
19 pages
PDC Assignment Group#7
No ratings yet
PDC Assignment Group#7
13 pages
Unit_4_HPC
No ratings yet
Unit_4_HPC
82 pages
Figure 2.3 Illustration of Amdahl's Law: © 2016 Pearson Education, Inc., Hoboken, NJ. All Rights Reserved
No ratings yet
Figure 2.3 Illustration of Amdahl's Law: © 2016 Pearson Education, Inc., Hoboken, NJ. All Rights Reserved
10 pages
Speedup and Efficiency of Parallel Algorithms: N N N P T Sequential T N P S
No ratings yet
Speedup and Efficiency of Parallel Algorithms: N N N P T Sequential T N P S
4 pages
Week 7
No ratings yet
Week 7
27 pages
performance metrics
No ratings yet
performance metrics
34 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
Lecture 2 Processor Performance
No ratings yet
Lecture 2 Processor Performance
11 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Parallel Computing - Unit III
No ratings yet
Parallel Computing - Unit III
74 pages
COE4590_12_Amdahls_Law
No ratings yet
COE4590_12_Amdahls_Law
18 pages
Amdahl Law
No ratings yet
Amdahl Law
2 pages
Performance Evaluation of Parallel Computers
No ratings yet
Performance Evaluation of Parallel Computers
37 pages
Implementing Matrix Multiplication On An Mpi Cluster of Workstations
No ratings yet
Implementing Matrix Multiplication On An Mpi Cluster of Workstations
9 pages
Computer Performance
No ratings yet
Computer Performance
27 pages
Clfaracterlzmlg Computer Perforiuance With A Single Hlum3Er
No ratings yet
Clfaracterlzmlg Computer Perforiuance With A Single Hlum3Er
5 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
SEN307-Lecture-5
No ratings yet
SEN307-Lecture-5
34 pages
PP 1
No ratings yet
PP 1
41 pages
Performance Metrices
100% (1)
Performance Metrices
18 pages
2.0 DD2356 DiscussingSpeedUp
No ratings yet
2.0 DD2356 DiscussingSpeedUp
13 pages
Session5-Processor Design Metrics-Perfomance Metrics
No ratings yet
Session5-Processor Design Metrics-Perfomance Metrics
10 pages
Parallel Algorithm Analysis
No ratings yet
Parallel Algorithm Analysis
11 pages
Computer Architecture and Performance
No ratings yet
Computer Architecture and Performance
33 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Karp
No ratings yet
Karp
5 pages
Lect 02
No ratings yet
Lect 02
51 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
Computer Architecture and Organization Ch#2 Examples
No ratings yet
Computer Architecture and Organization Ch#2 Examples
6 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Words and Images by Joel Meadows and Dan Palmer of S: Navigation Navigation
No ratings yet
Words and Images by Joel Meadows and Dan Palmer of S: Navigation Navigation
8 pages
Articulo
No ratings yet
Articulo
6 pages
BoQ Elevated Tank PDF
No ratings yet
BoQ Elevated Tank PDF
3 pages
Document
No ratings yet
Document
1 page
Overlapnet: Loop Closing For Lidar-Based Slam
No ratings yet
Overlapnet: Loop Closing For Lidar-Based Slam
10 pages
Zenon Ultra Plus API GL5 Fiche Technique
No ratings yet
Zenon Ultra Plus API GL5 Fiche Technique
1 page
A Reconfigurable CMOS Rectifier With 14-DB Power Dynamic Range Achieving Gt36-DB Mm2 FoM For RF-Based Hybrid Energy Harvesting
No ratings yet
A Reconfigurable CMOS Rectifier With 14-DB Power Dynamic Range Achieving Gt36-DB Mm2 FoM For RF-Based Hybrid Energy Harvesting
5 pages
State of Edge 2025 - Zededa
No ratings yet
State of Edge 2025 - Zededa
25 pages
Create Table SQL
No ratings yet
Create Table SQL
18 pages
Grand Vitara 08 PDF
No ratings yet
Grand Vitara 08 PDF
40 pages
wildlife preservation
No ratings yet
wildlife preservation
3 pages
Geometric student report - копия
No ratings yet
Geometric student report - копия
4 pages
PowerLogic PM5000 Series - METSEPM5310
No ratings yet
PowerLogic PM5000 Series - METSEPM5310
3 pages
BMED301 Week 1
No ratings yet
BMED301 Week 1
5 pages
How To Recognize China Cyber Threats
No ratings yet
How To Recognize China Cyber Threats
51 pages
Boarding Pass
No ratings yet
Boarding Pass
5 pages
Medtronic - 16688 - CPQ Training Specialist - (Minneapolis, MN Preferred)
No ratings yet
Medtronic - 16688 - CPQ Training Specialist - (Minneapolis, MN Preferred)
1 page
Usertransactionsgift Cards Infogift Order Id 1682555692923740161
No ratings yet
Usertransactionsgift Cards Infogift Order Id 1682555692923740161
1 page
Binomial Theorem NEW BOOK
No ratings yet
Binomial Theorem NEW BOOK
79 pages
Fast Lane - RH-DO180
No ratings yet
Fast Lane - RH-DO180
3 pages
Datasheet 7408 AND
No ratings yet
Datasheet 7408 AND
3 pages
EPANET NOTES 2025 Feb 17 ver3
No ratings yet
EPANET NOTES 2025 Feb 17 ver3
24 pages
Statement_1734343972226_unlocked (1) (1)
No ratings yet
Statement_1734343972226_unlocked (1) (1)
40 pages
Computer Lab Rules and Ethics Concept Outline
No ratings yet
Computer Lab Rules and Ethics Concept Outline
2 pages
LP3470
No ratings yet
LP3470
22 pages
Effective Business Communication
No ratings yet
Effective Business Communication
5 pages
Black Mist Rising Chapter 1
No ratings yet
Black Mist Rising Chapter 1
9 pages

15CS72 ACA Module1 Chapter3FinalCopy

Uploaded by

15CS72 ACA Module1 Chapter3FinalCopy

Uploaded by

Module 1: Chapter 3

Principles of Scalable Performance

3.1 Performance metrics and measures

3.1.1 Parallelism profile in programs

The execution of a program on a parallel computer may use different numbers of

Consider a parallel computer with n homogenous processor and maximum parallelism

The average parallelism A is given as follows

The asymptotic speedup S​∞​ is defined as follows

In ideal case S​∞​ = A

3.1.2 Mean Performance

The arithmetic mean execution time per instruction is as follows

Harmonic Mean speedup

Suppose a workload of multiple programs is to be executed on an n-processor system,

3.1.3 Efficiency, Utilization and Quality

3.1.4 Benchmarks and Performance Measures

The TPS and KLIPS rating

3.2 Parallel Processing Applications

3.2.1 Massive Parallelism for Grand Challenges

Exploiting Massive Parallelism

Some Early Representative Massively Parallel Processing Systems

3.2.2 Application Models of Parallel Computers.

Fixed Load Model: It corresponds to a constant workload with efficiency curve 𝛼 . It is

3.2.3 Scalability of Parallel Algorithms

The isoefficiency function of four matrix multiplication algorithm is given below

The response time is

Let Q(n) be overhead contributed by interprocessor communication, latencies of

Also Q(n) is dependent on application as well as machine.

3.3.2 Gustafson’s Law for Scaled problems.

Fixed Time Speedup

The Gustafson’s law can be restated in terms of α as given below.

W *n is the scaled workload and W *n = g * (nM) , where nM is the increased memory

g * (nM)= G(n). g(M)= G(n)W​n

3.4 SCALABILITY ANALYSIS AND APPROACHES

3.4.1 Scalability Metrics and Goals

SpeedUp and Efficiency Revisited

2.3 Program Flow Mechanisms

There are n processing elements(PE) interconnected by n* n routing network. Within

Demand Driven Mechanisms

Demand -driven computation corresponds to lazy evaluation because operations are

2.3.3 Comparison of Flow Mechanisms

You might also like

The asymptotic speedup S∞ is defined as follows

In ideal case S∞ = A

W n is the scaled workload and W n = g * (nM) , where nM is the increased memory

g * (nM)= G(n). g(M)= G(n)Wn