0% found this document useful (0 votes)
42 views48 pages

L14 Introduction To Performance Evaluation

This document provides an overview of key concepts and metrics for evaluating computer performance, including: - Execution time and throughput are two common notions of performance that can be in opposition. Faster execution time does not always mean higher throughput. - Performance improvement means reducing execution time, while higher throughput means more jobs completed per unit of time. - Amdahl's Law describes how the overall speedup from an enhancement is limited by the fraction of time the original system spends on the part that can benefit from the enhancement. - Key metrics discussed include clock cycles, clock frequency, instructions per clock, and millions of instructions per second (MIPS). Execution time is calculated based on instructions, clock cycles per instruction

Uploaded by

fjuopregheru5734
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views48 pages

L14 Introduction To Performance Evaluation

This document provides an overview of key concepts and metrics for evaluating computer performance, including: - Execution time and throughput are two common notions of performance that can be in opposition. Faster execution time does not always mean higher throughput. - Performance improvement means reducing execution time, while higher throughput means more jobs completed per unit of time. - Amdahl's Law describes how the overall speedup from an enhancement is limited by the fraction of time the original system spends on the part that can benefit from the enhancement. - Key metrics discussed include clock cycles, clock frequency, instructions per clock, and millions of instructions per second (MIPS). Execution time is calculated based on instructions, clock cycles per instruction

Uploaded by

fjuopregheru5734
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Course on: “Advanced Computer Architectures”

Performance Evaluation

Prof. Cristina Silvano


Politecnico di Milano
email: [email protected]

1
Basic concepts and performance
metrics
Performance
• Purchasing perspective
• given a collection of machines, which has the
• best performance?
• least cost?
• best performance / cost?
• Design perspective
• faced with design options, which has the
• best performance improvement?
• least cost?
• best performance / cost?
• Both require
• basis for comparison
• metrics for evaluation
• Our goal is to understand cost & performance implications of architectural
choices

Cristina Silvano, 21/03/2021 3


Two notions of “performance”
Throughput
Plane DC to Paris Speed Passengers
(pmph)

Boeing 747 6.5 hours 610 mph 470 286,700

BAD/Sud
3 hours 1350 mph 132 178,200
Concorde

Which has higher performance?


• Time to do the task (Execution Time)
– Execution time, response time, latency
• Number of jobs done per day, hour, sec, ns (Performance)
– Throughput, bandwidth
• Response time and throughput often are in opposition

Cristina Silvano, 21/03/2021 4


Example

• Time of Concorde vs. Boeing 747?


• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”
• Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times faster”

• Boeing is 1.6 times (“60%”) faster in terms of throughput


• Concord is 2.2 times (“120%”) faster in terms of flying time

We will focus primarily on execution time for a single job


Lots of instructions in a program => Instruction throughput important!

Cristina Silvano, 21/03/2021 5


Definitions
• “X is n% faster than Y”  execution time (y) = 1 +__n__
execution time (x) 100
performance(x) = ___ 1
execution_time(x)

• “X is n% faster than Y”  performance(x) = 1 + __n__


performance(y) 100

Cristina Silvano, 21/03/2021 6


Performance Improvement

• Performance improvement means increment:


• Higher is better
• Execution time (or response time) means decrement:
• Lower is better

Cristina Silvano, 21/03/2021 7


Example
If machine A executes a program in 10 sec and machine B executes same
program in 15 sec:
A is 50% faster than B or A is 33% faster than B?
Solution:
• The statement A is n% faster than B can be expressed as:
• “A is n% faster than B”  execution time (B)= 1 +__n__
execution time (A) 100
 n = execution time (B)- execution time (A) *100
execution time (A)
(15 -10)/10 *100 = 50  A is 50% faster than B.

Cristina Silvano, 21/03/2021 8


Clock cycles
• TCLK = Period or clock cycle time = Time between two consecutive
clock pulses
• Seconds per cycle
• fCLK= Clock frequency = Clock cycles per second: fCLK = 1 / TCLK
• Where: 1 Hz = 1 / sec
• Examples:
• The clock frequency of 500 MHz corresponds to a clock cycle
time: 1 / (500 * 106) = 2 * 10 –9 = 2 nsec
• The clock frequency of 1 GHz corresponds to a clock cycle
time: 1 / (109) = 1 * 10 –9 = 1 nsec

Cristina Silvano, 21/03/2021 9


Execution time or CPU Time
CPU time = Clock Cycles x TCLK = Clock Cycles
fCLK

• To optimize performance means to reduce the execution time (or CPU


time):
• To reduce the number of clock cycles per program
• To reduce the clock period Tclk
• To increase the clock frequency fCLK

Cristina Silvano, 21/03/2021 10


CPU Time
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

CPU time = Seconds = IC x CPI x TCLK = IC x CPI


Program fCLK

• where the Clock Per Instruction is given by:


CPI = Clock Cycles / Instruction Count

• The Instruction Per Clock is given by: IPC = 1 / CPI

Cristina Silvano, 21/03/2021 11


CPU Time
CPU time = Clock Cycles x TCLK = Clock Cycles
fCLK
n
Where: Clock Cycles =  ( CPI x I ) i i
i =1
n
 CPU time =  ( CPI x I ) x T
i i CLK
i =1
n
CPI =  ( CPI x F )
i i
Where Fi = Ii
IC
i =1
"instruction frequency"

 CPU time = IC x CPI x TCLK = IC x  ( CPI i * Fi) x TCLK

Cristina Silvano, 21/03/2021 12


Example
Frequency Clock Cycles
ALU 43% 1
Load 21% 4
Store 12% 4
Branch 12% 2
Jump 12% 2

• Evaluate the CPI and the CPU time to execute a program


composed of 100 instructions mixed as in the table by using 500
MHz clock frequency:

CPI = 0.43 * 1 + 0.21 * 4 + 0.12 * 4 + 0.12 * 2 + 0.12 * 2 = 2.23

CPU time = IC * CPI * T clock = 100 * 2.23 * 2 ns = 446 ns

Cristina Silvano, 21/03/2021 13


MIPS Millions of instructions per second
MIPS = Instruction Count
Execution time x 106

Where: Execution time = IC x CPI


fCLK

MIPS = fCLK
CPI x 106

Cristina Silvano, 21/03/2021 14


Amdahl’s Law
How to evaluate the speedup
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task


by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)


Speedup(with E) = 1
(1-F) + F/S

Cristina Silvano, 21/03/2021 16


Amdahl’s Law
• Basic idea: Make the most common case fast
• Amdahl’s Law: The performance improvement to be gained
from using some faster execution modes is limited by the fraction
of the time the faster mode can be used. Let us assume:
• FractionE the fraction of the computation time in the original
machine that can be converted to take advantage of the
enhancement
• SpeedupE the improvement gained by the enhanced
execution mode
• The overall speed up is given by:

ExTimeold 1
Speedupoverall = -------------------- = ----------------------------------------
ExTimenew (1 – FractionE) + FractionE
SpeedupE

Cristina Silvano, 21/03/2021 17


Example
• Let us consider an enhancement for a CPU resulting ten time
faster on computation than the original one but the original CPU is
busy with computation only 40% of the time. What is the overall
speedup gained by introducing the enhancement?
• Solution: Application of Amdahl’s Law where:
• FractionE = 0.4
• SpeedupE = 10
• The overall speed up is given by:

1 1
Speedupoverall = ---------------------------------------- = ---------------- = 1.56
(1 – FractionE) + FractionE 0.6 + 0.04
SpeedupE

Cristina Silvano, 21/03/2021 18


Basis of Evaluation
Pros Cons
• very specific
• representative • non-portable
Actual Target Workload • difficult to run, or
measure
• hard to identify cause
• portable
• widely used •less representative
• improvements Full Application Benchmarks
useful in reality

• easy to run, early in Small “Kernel” • easy to “fool”


design cycle Benchmarks

• identify peak • “peak” may be a long


capability and Microbenchmarks way from application
potential bottlenecks performance

Cristina Silvano, 21/03/2021 19


Metrics of performance
Answers per month
Application
Useful Operations per second
Programming
Language
Compiler
(millions) of Instructions per second – MIPS
ISA (millions) of (F.P.) operations per second – MFLOP/s

Datapath
Control Megabytes per second
Function Units
TransistorsWires Pins Cycles per second (clock rate)

Each metric has a place and a purpose, and each can be misused

Cristina Silvano, 21/03/2021 20


Aspects of CPU Performance

CPU time= IC x CPI x TCLK

instr count CPI clock rate


Program X

Compiler X X

Instr. Set X X X

Organization X X

Technology X

Cristina Silvano, 21/03/2021 21


Performance evaluation in pipelined
processors
Performance Issues in Pipelining
• Pipelining increases the CPU instruction throughput (number of
instructions completed per unit of time), but it does not reduce
the execution time (latency) of a single instruction.
• Pipelining usually slightly increases the latency of each
instruction due to imbalance among the pipeline stages and
overhead in the control of the pipeline.
• Imbalance among pipeline stages reduces performance since
the clock can run no faster than the time needed for the
slowest pipe stage.
• Pipeline overhead arises from pipeline register delay and
clock skew.

- 23 -
Cristina Silvano, 21/03/2021
Performance Metrics
IC = Instruction Count

# Clock Cycles = IC + # Stall Cycles + 4

CPI = Clock Per Instruction = # Clock Cycles / IC =


(IC + # Stall Cycles + 4) / IC

MIPS = fclock / (CPI * 10 6)

- 24 -
Cristina Silvano, 21/03/2021
Example
IC = Instruction Count = 5
# Clock Cycles = IC + # Stall Cycles + 4 = 5 + 3 + 4 = 12
CPI = Clock Per Instruction = # Clock Cycles / IC = 12 / 5 = 2.4
MIPS = fclock / (CPI * 10 6) = 500 MHz / 2.4 * 10 6 = 208.3

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12


sub $2, $1, $3 IF ID EX ME WB

and $12, $2, $5 IF stall stall stall ID EX ME WB


IF ID EX ME WB
or $13, $6, $2
IF ID EX ME WB
add $14, $2, $2
IF ID EX ME WB
sw $15,100($2)

- 25 -
Cristina Silvano, 21/03/2021
Performance Metrics (2)
• Let us consider n iterations of a loop composed of m instructions per
iteration requiring k stalls per iteration

IC per_iter = m
# Clock Cycles per iter = IC per_iter + # Stall Cycles per_iter +4
CPI per_iter = (IC per iter + # Stall Cycles per_iter +4) /IC per_iter

= (m + k + 4) / m
MIPS per_iter = fclock / (CPI per_iter * 10 6)

- 26 -
Cristina Silvano, 21/03/2021
Asymptotic Performance Metrics
• Let us consider n iterations of a loop composed of m instructions
per iteration requiring k stalls per iteration

ICAS = Instruction Count AS = m * n


# Clock Cycles = IC AS + # Stall CyclesAS + 4
CPI AS = lim n ->  ( IC AS + # Stall CyclesAS + 4) /IC AS

= lim n ->  ( m *n + k * n + 4 ) / m * n
= (m + k) / m
MIPS AS = fclock / (CPIAS* 10 6)

- 27 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
• The ideal CPI on a pipelined processor would be 1, but stalls
cause the pipeline performance to degrade form the ideal
performance, so we have:

Ave. CPI Pipe = Ideal CPI + Pipe Stall Cycles per Instruction
= 1 + Pipe Stall Cycles per Instruction

• Pipe Stall Cycles per Instruction are due to Structural Hazards +


Data Hazards + Control Hazards + Memory Stalls

- 28 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
Pipeline Speedup = Ave. Exec. Time Unpipelined =
Ave. Exec. Time Pipelined
= Ave. CPI Unp. x Clock Cycle Unp. =
Ave. CPI Pipe Clock Cycle Pipe

- 29 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
• If we ignore the cycle time overhead of pipelining and we assume the
stages are perfectly balanced, the clock cycle time of two processors
can be equal, so:
Pipeline Speedup = __________Ave. CPI Unp._________
1 + Pipe Stall Cycles per Instruction
• Simple case: All instructions take the same number of cycles, which
must also equal to the number of pipeline stages (called pipeline
depth):
Pipeline Speedup = __________Pipeline Depth________
1 + Pipe Stall Cycles per Instruction
• If there are no pipeline stalls (ideal case), this leads to the intuitive
result that pipelining can improve performance by the depth of the
pipeline.

- 30 -
Cristina Silvano, 21/03/2021
Performance of Branch Schemes
• What is the performance impact of conditional branches?

Pipeline Speedup = _______________Pipeline Depth_________________


1 + Pipe Stall Cycles per Instruction due to Branches

= _______________Pipeline Depth_________________
1 + Branch Frequency x Branch Penalty

- 31 -
Cristina Silvano, 21/03/2021
Performance evaluation of the
memory hierarchy
Memory Hierarchy: Definitions
• Hit: data found in a block of the upper level
• Hit Rate: Number of memory accesses that find the data in the
upper level with respect to the total number of memory accesses
Hit Rate = # hits ______
# memory accesses

• Hit Time: time to access the data in the upper level of the
hierarchy, including the time needed to decide if the attempt of
access will result in a hit or miss

Cristina Silvano, 21/03/2021 33


Memory Hierarchy: Definitions
• Miss: the data must be taken from the lower level
• Miss Rate: number of memory accesses not finding the data in
the upper level with respect to the total number of memory
accesses
Miss Rate = # misses
# memory accesses
• By definition: Hit Rate + Miss Rate = 1

• Miss Penalty : time needed to access the lower level and to


replace the block in the upper level
• Miss Time = Hit Time + Miss Penalty
• Typically: Hit Time << Miss Penalty

Cristina Silvano, 21/03/2021 34


Cache Memory: Basic Concepts
Average Memory Access Time
AMAT = Hit Rate * Hit Time + Miss Rate * Miss Time

Being: Miss Time = Hit Time + Miss Penalty

=> AMAT = Hit Rate * Hit Time + Miss Rate * (Hit Time + Miss Penalty)

 AMAT = (Hit Rate + Miss Rate) * Hit Time + Miss Rate * Miss Penalty

By definition: Hit Rate + Miss Rate = 1

 AMAT= Hit Time + Miss Rate * Miss Penalty

Cristina Silvano, 21/03/2021 35


Performance evaluation:
Impact of memory hierarchy on CPUtime
CPUtime = (CPU exec cycles + Memory stall cycles) x TCLK

where: TCLK = T clock cycle time period


CPU exec cycles = IC x CPIexec
IC = Instruction Count
(CPIexec includes ALU and LD/STORE instructions)
Memory stall cycles = IC x Misses per instr x Miss Penalty

 CPUtime = IC x (CPIexec + Misses per instr x Miss penalty) x TCLK


where:
Misses per instr = Memory Accesses Per Instruction x Miss rate

 CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 36


Performance evaluation:
Impact of memory hierarchy on CPUtime
CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Let us consider an ideal cache (100% hits):


CPUtime = IC x CPIexec x TCLK

Let us consider a system without cache (100% misses):


CPUtime = IC x (CPIexec + MAPI x Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 37


Performance evaluation:
Impact of memory hierarchy and
pipeline stalls on CPUtime
CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Putting all together: Let us also consider the stalls due to pipeline hazards:

CPUtime = IC x (CPIexec + Stalls per instr + MAPI x Miss rate x


Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 38


Cache Performance

• Average Memory Access Time:


AMAT= Hit Time + Miss Rate * Miss Penalty

• How to improve cache performance:


1. Reduce the hit time
2. Reduce the miss rate
3. Reduce the miss penalty

Cristina Silvano, 21/03/2021 39


Unified Cache vs Separate I$ & D$
(Harvard architecture)
Processor Processor

Unified L1 cache I-cache L1 D-cache L1

To better exploit the locality principle

• Average Memory Access Time for Separate I$ & D$


AMAT= % Instr. (Hit Time + I$ Miss Rate * Miss Penalty) +
% Data (Hit Time + D$ Miss Rate * Miss Penalty)

• Usually: I$ Miss Rate << D$ Miss Rate

Cristina Silvano, 21/03/2021 40


Unified vs Separate I$ & D$:
Example of comparison
• Assumptions:
• 16KB I$ & D$: I$ Miss Rate=0.64% D$ Miss Rate=6.47%
• 32KB unified: Aggregate Miss Rate=1.99%
• Which is better?
• Assume 33% loads/stores (data ops)
 75% accesses from instructions (1.0/1.33)
 25% accesses from data (0.33/1.33)
• Hit time=1, Miss Penalty = 50
• Note data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Cristina Silvano, 21/03/2021 41


Miss Penalty Reduction:
Second Level Cache
Basic Idea:
• L1 cache small enough to match the fast CPU cycle time
• L2 cache large enough to capture many accesses that would go to
main memory reducing the effective miss penalty

Processor

L1 cache

L2 cache

Main Memory

Cristina Silvano, 21/03/2021 42


AMAT for L1 and L2 Caches

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

where: Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

 AMAT = Hit TimeL1 + Miss RateL1 x Hit TimeL2 + Miss RateL1 L2 x Miss PenaltyL2

Cristina Silvano, 21/03/2021 43


Local and global miss rates
• Definitions:
• Local miss rate: misses in this cache divided by the total number
of memory accesses to this cache: the Miss rateL1 for L1 and the
Miss rateL2 for L2
• Global miss rate: misses in this cache divided by the total
number of memory accesses generated by the CPU:
• for L1, the global miss rate is still just Miss RateL1
• for L2, it is (Miss RateL1 x Miss RateL2)
• Global miss rate is what really matters: it indicates what fraction
of memory accesses from CPU go all the way to main memory

Cristina Silvano, 21/03/2021 44


Example
• Let us consider a computer with a L1 cache and L2 cache
memory hierarchy. Suppose that in 1000 memory references
there are 40 misses in L1 and 20 misses in L2.
• What are the various miss rates?
Miss Rate L1 = 40 /1000 = 4% (either local or global)
Miss Rate L2 = 20 /40 = 50%
• Global Miss Rate for Last Level Cache (L2):
Miss Rate L1 L2 = Miss RateL1 x Miss RateL2=
(40 /1000) x (20 /40) = 2%

Cristina Silvano, 21/03/2021 45


Memory stalls per instructions for L1
and L2 caches
• Average memory stalls per instructions:
Memory stall cycles per instr = Misses per instr x Miss Penalty

• Average memory stalls per instructions for L1 and L2 caches:


Memory stall cycles per instr =
MissesL1 per instr X Hit TimeL2 + MissesL2 per instr X Miss PenaltyL2

Cristina Silvano, 21/03/2021 46


Impact of L1 and L2 on CPUtime
CPUtime = IC x (CPIexec + Memory stall cycles per instr) x TCLK

where:
Memory stall cycles per instr =
MissesL1 per instr X Hit TimeL2 + MissesL2 per instr X Miss PenaltyL2
MissesL1 per instr = Memory Accesses Per Instr x Miss RateL1
MissesL2 per instr = Memory Accesses Per Instr x Miss RateL1 L2

CPUtime= IC x (CPIexec + MAPI x MRL1 x HTL2 + MAPI x MRL1 L2 x MPL2 ) x TCLK

Cristina Silvano, 21/03/2021 47


References
• Chapter 1 of Text Book:
J. Hennessey, D. Patterson,
“Computer Architecture: a quantitative approach”
5 th Edition, Morgan-Kaufmann Publishers.

- 48 -

You might also like