CS-3006 4 PerformanceAnalysis
CS-3006 4 PerformanceAnalysis
Example:
ninstr(A) = 4 Millions
TU_CPU(A) = 0.05 seconds
Example:
rcycle = 600 MHz (Mega == 106)
CPI(A) = 3
Example:
nflp_op(A) = 90 Millions (floating-point operations)
TU_CPU(A) = 3.5 seconds
• Macrobenchmarks
– Application execution time
• Measures overall performance, using one application
• Need application suite
Popular Benchmark Suites
• Desktop
– SPEC CPU2000 - CPU intensive, integer & floating-point applications
– SPECviewperf, SPECapc - Graphics benchmarks
– SysMark, Winstone, Winbench
• Embedded
– EEMBC - Collection of kernels from 6 application areas
– Dhrystone - Old synthetic benchmark
• Servers
– SPECweb, SPECfs
– TPC-C - Transaction processing system
– TPC-H, TPC-R - Decision support system
– TPC-W - Transactional web benchmark
• Parallel Computers
– SPLASH - Scientific applications & kernels
– Linpack
Limitations of Memory System Performance
Limitations of Memory System Performance
• Example
• Consider a processor operating at 1 GHz (1 ns clock) connected to a
DRAM with a latency of 100 ns (no caches). Assume that the processor
has two multiply-add units and can execute four instructions in each
cycle of 1 ns.
• The peak processor rating = 4 GFLOPS. Since the memory latency is
equal to 100 cycles (each cycle is 1 ns) and block size is one word, every
time a memory request is made, the processor must wait 100 cycles
before it can process the data.
• Consider the problem of computing the dot-product of two vectors on
such a platform. A dot-product computation performs one multiply-add
on a single pair of vector elements, i.e., each floating point operation
requires one data fetch.
• It is easy to see that the peak speed of this computation is limited to one
floating point operation every 100 ns, or a speed of 10 MFLOPS.
Impact of Cache on System Performance
• Example
• Consider a processor with 1 GHz (1 ns clock) with a 100 ns latency
DRAM. Introduction of 32 KB Cache for Multiplying Matrixes. The
cache of size 32 KB with 1 ns latency.
• Stores two matrices A and B of dimensions 32 x 32. Assumes an
ideal cache placement strategy. Fetching the two matrices
(corresponds to fetching 2K words) takes approximately 200 µs.
• Multiplying two n x n matrices takes = 2n 3 operations. Therefore,
2*(323) = 64K operations in 16K cycles or 16 μs (four instructions
per cycle).
• Total computation time = 200 µs + 16 μs or 303 MFLOPS
• Improvement is 30 times as compared to the previous example.
However, as compared to peak performance (4GFLOPS), it is 10%.
Performance Metrics – Parallel Systems
Amdahl's Law & Speedup Factor
Amdahl's Law
Amdahl's Law states that potential program
speedup is defined by the fraction of code (P)
that can be parallelized:
1
Max.speedup = --------
1 - P
f = serial fraction
E.g., 1/0.05 (5% serial) = 20 speedup (maximum)
Maximum Speedup (Amdahl's Law)
Maximum speedup is usually p with p processors
(linear speedup).
e.g. if f==1 (all the code is serial, then the speedup will be 1
no matter how may processors are used
Speedup (with N CPUs or Machines)
• Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modelled by:
1
speedup = ------------
fS + fP
-----
Proc
• Superlinear Speedup
– Speedup of >N, for N processors
• Theoretically not possible
• How is this achievable on real machines?
– Think about physical resources (cache, memory
etc) of N processors
Super-linear Speedup
Super-linear Speedup Example - Searching
Super-linear Speedup Example - Searching
Efficiency
• Efficiency is the ability to avoid wasting materials,
energy, efforts, money, and time in doing something or
in producing a desired result
Problem size
S: Speedup
E: Efficiency
Speedup
Efficiency
Gustafson’s Law
Amdahl’s law Sufficient?
• Amdahl’s law works on a fixed problem size
– Shows how execution time decreases as number of
processors increases
– Limits maximum speedup achievable
– So, does it mean large parallel machines are not
useful?
– Ignores performance overhead (e.g. communication,
load imbalance)
OR