0% found this document useful (0 votes)
24 views62 pages

CS-3006 4 PerformanceAnalysis

The document discusses performance analysis in computer systems, focusing on various metrics such as execution time, clock speed, MIPS, and FLOPS. It emphasizes the importance of benchmarks for evaluating system performance and introduces Amdahl's Law and Gustafson's Law to explain the limitations and scalability of parallel computing. Additionally, it highlights factors affecting performance and provides examples of calculating theoretical compute power and speedup in parallel systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views62 pages

CS-3006 4 PerformanceAnalysis

The document discusses performance analysis in computer systems, focusing on various metrics such as execution time, clock speed, MIPS, and FLOPS. It emphasizes the importance of benchmarks for evaluating system performance and introduces Amdahl's Law and Gustafson's Law to explain the limitations and scalability of parallel computing. Additionally, it highlights factors affecting performance and provides examples of calculating theoretical compute power and speedup in parallel systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Performance Analysis

Dr. Muhammad Mateen Yaqoob,

Department of AI & DS,


National University of Computer & Emerging Sciences,
Islamabad Campus
Performance?
• To measure improvement in computer architectures, it
is necessary to compare alternative designs

• A better system has better performance, but what


exactly is the performance?
Performance?
Performance Metrics – Sequential
Systems
Performance?
• For the computer systems and programs:
– one main performance metric is Time
– Or just wall-clock time
Performance?
The execution time of a program A can be split into:
• User CPU time: capturing the time that the CPU spends for
executing A
• System CPU time: capturing the time that the CPU spends
for the execution of routines of the operating system issued
by A
• Waiting time: caused by waiting for the completion of I/O
operations and by the execution of other programs because
of time sharing

Here we concentrate on user CPU time


Computer Performance
• Measuring Computer Performance
– Clock Speed
– MIPS
– FLOPS
– Benchmark Tests

• Factors affecting Computer Performance


– Processor Speed
– Data-bus width
– Amount of cache
– Faster interfaces
– Amount of main-memory
Measuring Performance
• Every processor has a clock which ticks
continuously at a regular rate

• Clock synchronises all the digital components

• Cycle time measured in GHz

• 200 MHz (megahertz) means the clock ticks


200,000,000 (200 million) times a second
(Pentium1 -1995)
Machine Clock Rate
• Clock Rate (CR) in MHz, GHz, etc. is inverse of Clock Cycle (CC)
time (time of a single clock period)
CC = 1 / CR

one clock period

10 nsec clock cycle  100 MHz clock rate


5 nsec clock cycle  200 MHz clock rate
2 nsec clock cycle  500 MHz clock rate
1 nsec clock cycle  1 GHz clock rate
500 psec clock cycle  2 GHz clock rate
250 psec clock cycle  4 GHz clock rate
200 psec clock cycle  5 GHz clock rate
Measuring Performance
• Clock Speed
– Generally the faster the clock speed the faster the
processor – 3.2 GHz is faster than 1.2 GHz

• MIPS – Millions of Instructions per Second


– Better comparison
– But beware of false claims:
• Such as, only using the simplest & fastest instructions
and different processor families (having different ISA).
Measuring Performance
• Flops – Floating Point Operations per sec.
– Best measure as FP operations are the same in every
processor and provide best basis
– Measure of theoretical peak performance
Measuring Performance
• Flops – Floating Point Operations per sec.
– Servers are the only computers that sometimes have
more than one socket; for most home computers
(desktop or laptop), “sockets” will be 1.
– Cores per socket depend on your CPU. It could be 2
(dual-core), 3, 4 (quad-core), 6 (hexacore), or 8. There
are some prototype CPUs with as many as 80 cores.
– “Clock cycles per second” refers to the speed of your
CPU. Most modern CPUs are rated in gigahertz. So 2
GHz would be 2,000,000,000 clock cycles per second.
– The number of FLOPs per cycle also depends on the
CPU. One of the fastest (home computer) CPUs is the
Intel Core i7–970, capable of 4 double-precision or 8
single-precision floating-point operations per cycle.
Measuring Performance
• Test Example
– Intel Core i7–970 has 6 cores. If it is running at 3.46 GHz
and can perform 8 floating point operations per second,
calculate the theoretical compute power of this
machine.

– Intel Core i7–970 has 6 cores. If it is running at 3.46 GHz,


the formula would be:
– 1 (socket) * 6 (cores) * 3,460,000,000 (cycles per
second) * 8 (single-precision FLOPs per second) =
166,080,000,000 single-precision FLOPs per second or
83,040,000,000 double- precision FLOPs per second.
– 109 GFLOPS.
Units of High Performance Computing

Basic Unit Speed Capacity


Kilo 1 Kflop/s 103 Flop/second 1 KB 103 Bytes
Mega 1 Mflop/s 106 Flop/second 1 MB 106 Bytes
Giga 1 Gflop/s 109 Flop/second 1 GB 109 Bytes
Tera 1 Tflop/s 1012 Flop/second 1 TB 1012 Bytes
Peta 1 Pflop/s 1015 Flop/second 1 PB 1015 Bytes
Exa 1 Eflop/s 1018 Flop/second 1 EB 1018 Bytes
Zeta 1 Zflop/s 1021 Flop/second 1 ZB 1021 Bytes
Measuring Performance
• When we measure performance we usually mean how
fast the computer carries out instructions

• The measure we use is MIPS (Millions of Instructions


Per Second).

• MIPS affected by:


– The clock speed of the processor
– The speed of the buses
– The speed of memory access.
MIPS

Example:
ninstr(A) = 4 Millions
TU_CPU(A) = 0.05 seconds

4 / 0.05 = 80 Millions / 106 = 80 MIPS


MIPS

Example:
rcycle = 600 MHz (Mega == 106)
CPI(A) = 3

600 * 106 / 3 = 200 * 106 / 106 = 200 MIPS


MFLOPs

Example:
nflp_op(A) = 90 Millions (floating-point operations)
TU_CPU(A) = 3.5 seconds

(90 * 106)/ (3.5 * 106) = 25.71 MFLOPS(A)


Benchmarks
Why Do Benchmarks?
• How we evaluate differences?
–Different systems
–Changes to a single system

• Benchmarks represent large class of important


programs
Benchmarks
• Microbenchmarks
– Measure one performance dimension or aspect
• Cache bandwidth
• Memory bandwidth
• Procedure call overhead
• FP performance
– Insight into the underlying performance factors
– Not a good predictor of overall application performance

• Macrobenchmarks
– Application execution time
• Measures overall performance, using one application
• Need application suite
Popular Benchmark Suites
• Desktop
– SPEC CPU2000 - CPU intensive, integer & floating-point applications
– SPECviewperf, SPECapc - Graphics benchmarks
– SysMark, Winstone, Winbench
• Embedded
– EEMBC - Collection of kernels from 6 application areas
– Dhrystone - Old synthetic benchmark
• Servers
– SPECweb, SPECfs
– TPC-C - Transaction processing system
– TPC-H, TPC-R - Decision support system
– TPC-W - Transactional web benchmark
• Parallel Computers
– SPLASH - Scientific applications & kernels
– Linpack
Limitations of Memory System Performance
Limitations of Memory System Performance
• Example
• Consider a processor operating at 1 GHz (1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the processor
has two multiply-add units and can execute four instructions in each
cycle of 1 ns.
• The peak processor rating = 4 GFLOPS. Since the memory latency is
equal to 100 cycles (each cycle is 1 ns) and block size is one word, every
time a memory request is made, the processor must wait 100 cycles
before it can process the data.
• Consider the problem of computing the dot-product of two vectors on
such a platform. A dot-product computation performs one multiply-add
on a single pair of vector elements, i.e., each floating point operation
requires one data fetch.
• It is easy to see that the peak speed of this computation is limited to one
floating point operation every 100 ns, or a speed of 10 MFLOPS.
Impact of Cache on System Performance
• Example
• Consider a processor with 1 GHz (1 ns clock) with a 100 ns latency
DRAM. Introduction of 32 KB Cache for Multiplying Matrixes. The
cache of size 32 KB with 1 ns latency.
• Stores two matrices A and B of dimensions 32 x 32. Assumes an
ideal cache placement strategy. Fetching the two matrices
(corresponds to fetching 2K words) takes approximately 200 µs.
• Multiplying two n x n matrices takes = 2n 3 operations. Therefore,
2*(323) = 64K operations in 16K cycles or 16 μs (four instructions
per cycle).
• Total computation time = 200 µs + 16 μs or 303 MFLOPS
• Improvement is 30 times as compared to the previous example.
However, as compared to peak performance (4GFLOPS), it is 10%.
Performance Metrics – Parallel Systems
Amdahl's Law & Speedup Factor
Amdahl's Law
Amdahl's Law states that potential program
speedup is defined by the fraction of code (P)
that can be parallelized:

1
Max.speedup = --------
1 - P

• If none of the code can be parallelized, P = 0 and the speedup = 1


(no speedup). If all of the code is parallelized, P = 1 and the
speedup is infinite (in theory).

• If 50% of the code can be parallelized, maximum speedup = 2,


meaning the code will run twice as fast
Amdahl's Law
• It soon becomes obvious that there are limits to the
scalability of parallelism

• For example, at P = .50, .90 and .99 (50%, 90% and


99% of the code is parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Amdahl's Law for Parallel Program
• Example
• If 30% of the execution time may be the subject of a
speedup, p will be 0.3; if the improvement makes the
affected part twice as fast, s will be 2. Amdahl's law states
that the overall speedup of applying the improvement will
be?
Amdahl's Law for Parallel Program
• Example
• Assume that we are given a serial task that is split into four
consecutive parts, whose percentages of execution time are
p1=0.11, p2=0.18, p3=0.23, and p4=0.48 respectively. Then we are
told that the 1st part is not sped up, so s1=1, while the 2nd part is
sped up 5 times, so s2=5, the 3rd part is sped up 20 times, so s3=20,
and the 4th part is sped up 1.6 times, so s4=1.6. By using Amdahl's
law, the overall speedup is?
Amdahl's Law
Amdahl's Law
Amdahl's Law
Maximum Speedup (Amdahl's Law)

f = serial fraction
E.g., 1/0.05 (5% serial) = 20 speedup (maximum)
Maximum Speedup (Amdahl's Law)
Maximum speedup is usually p with p processors
(linear speedup).

Possible to get super-linear speedup (greater than p)


but usually a specific reason such as:
• Extra memory in multiprocessor system
• Nondeterministic algorithm
Speedup
Speedup

where ts is execution time on a single processor and tp is


execution time on a multiprocessor.

• S(p) gives increase in speed by using multiprocessor

• Use best sequential algorithm with single processor


system instead of parallel program run with 1
processor for ts. Underlying algorithm for parallel
implementation might be (and is usually) different.
Speedup
Speedup can also be used in terms of computational
steps:
Speedup

Here f is the part of the code that is serial:

e.g. if f==1 (all the code is serial, then the speedup will be 1
no matter how may processors are used
Speedup (with N CPUs or Machines)
• Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modelled by:
1
speedup = ------------
fS + fP
-----

Proc

• where fP = parallel fraction,


Proc = number of processors and
fS = serial fraction
Linear and Superlinear Speedup
• Linear speedup
– Speedup of N, for N processors
– Parallel program is perfectly scalable
– Rarely achieved in practice

• Superlinear Speedup
– Speedup of >N, for N processors
• Theoretically not possible
• How is this achievable on real machines?
– Think about physical resources (cache, memory
etc) of N processors
Super-linear Speedup
Super-linear Speedup Example - Searching
Super-linear Speedup Example - Searching
Efficiency
• Efficiency is the ability to avoid wasting materials,
energy, efforts, money, and time in doing something or
in producing a desired result

• The ability to do things well, successfully, and without


waste
Efficiency
Speedups and Efficiencies of parallel program on
different Problem Sizes

Machine size (Processors)

Problem size

S: Speedup
E: Efficiency
Speedup
Efficiency
Gustafson’s Law
Amdahl’s law Sufficient?
• Amdahl’s law works on a fixed problem size
– Shows how execution time decreases as number of
processors increases
– Limits maximum speedup achievable
– So, does it mean large parallel machines are not
useful?
– Ignores performance overhead (e.g. communication,
load imbalance)

• Gustafson’s Law says that increase of problem size for


large machines can retain scalability with respect to
the number of processors
Gustafson’s Law
• Time-constrained scaling (i.e., we have fixed-time
to do performance analysis or execution)
• Example: a user wants more accurate results
within a time limit
• Execution time is fixed as system scales
Amdahl versus Gustafson's Law

Introduction to Parallel Computing, University of Oregon, IPCC


Credits: University of Oregon
Amdahl versus Gustafson's Law

Introduction to Parallel Computing, University of Oregon, IPCC


Credits: University of Oregon
Gustafson’s Law

• P processors, with increased number of processors the


problem size will also be increased
– Importantly parallel part will be increased
Gustafson’s Law

where, S(p) Scaled Speedup, using P processors


s  fraction of program that is serial (cannot be parallelized)
Gustafson’s Law- Example

S(p) = 10 + (1 - 10) * (0.03) = 10 – 0.27 = 9.73


(Scaled Speedup)

Speedup Using Amdahl's Law ?


Gustafson’s Law- Example
Speedup Using Amdahl's Law

S(p) = 1 / ((0.03)+ (0.97)/10


= 1 / 0.03 + 0.097
= 1 / 0.127
= 7.874

OR

S(p) = 10 / 1 + (10-1)* (0.03)


= 1 / 1 + 0.027
= 1 / 1.027
= 7.874
Summary Gustafson’s Law
• Derived by fixing the parallel execution time (Amdahl
fixes the problem size -> fixed serial execution time)

• For many practical situations, Gustafson’s law makes


more sense
• Have a bigger computer, solve a bigger problem
Scalability
• In general, a problem is scalable if it can handle ever
increasing problem sizes

• If we increase the number of processes/threads and keep


the efficiency fixed without increasing problem size, the
problem is strongly scalable.

• If we keep the efficiency fixed by increasing the problem


size at the same rate as we increase the number of
processes/threads, the problem is weakly scalable.
Any Questions

You might also like