0% found this document useful (0 votes)

42 views48 pages

L14 Introduction To Performance Evaluation

This document provides an overview of key concepts and metrics for evaluating computer performance, including: - Execution time and throughput are two common notions of performance that can be in opposition. Faster execution time does not always mean higher throughput. - Performance improvement means reducing execution time, while higher throughput means more jobs completed per unit of time. - Amdahl's Law describes how the overall speedup from an enhancement is limited by the fraction of time the original system spends on the part that can benefit from the enhancement. - Key metrics discussed include clock cycles, clock frequency, instructions per clock, and millions of instructions per second (MIPS). Execution time is calculated based on instructions, clock cycles per instruction

Uploaded by

fjuopregheru5734

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views48 pages

L14 Introduction To Performance Evaluation

Uploaded by

fjuopregheru5734

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Course on: “Advanced Computer Architectures”

Performance Evaluation

Prof. Cristina Silvano

Politecnico di Milano
email: [email protected]

1
Basic concepts and performance
metrics
Performance
• Purchasing perspective
• given a collection of machines, which has the
• best performance?
• least cost?
• best performance / cost?
• Design perspective
• faced with design options, which has the
• best performance improvement?
• least cost?
• best performance / cost?
• Both require
• basis for comparison
• metrics for evaluation
• Our goal is to understand cost & performance implications of architectural
choices

Cristina Silvano, 21/03/2021 3

Two notions of “performance”
Throughput
Plane DC to Paris Speed Passengers
(pmph)

Boeing 747 6.5 hours 610 mph 470 286,700

BAD/Sud
3 hours 1350 mph 132 178,200
Concorde

Which has higher performance?

• Time to do the task (Execution Time)
– Execution time, response time, latency
• Number of jobs done per day, hour, sec, ns (Performance)
– Throughput, bandwidth
• Response time and throughput often are in opposition

Cristina Silvano, 21/03/2021 4

Example

• Time of Concorde vs. Boeing 747?

• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”
• Boeing is 286,700 pmph / 178,200 pmph = 1.60 “times faster”

• Boeing is 1.6 times (“60%”) faster in terms of throughput

• Concord is 2.2 times (“120%”) faster in terms of flying time

We will focus primarily on execution time for a single job

Lots of instructions in a program => Instruction throughput important!

Cristina Silvano, 21/03/2021 5

Definitions
• “X is n% faster than Y”  execution time (y) = 1 +__n__
execution time (x) 100
performance(x) = ___ 1
execution_time(x)

• “X is n% faster than Y”  performance(x) = 1 + n

performance(y) 100

Cristina Silvano, 21/03/2021 6

Performance Improvement

• Performance improvement means increment:

• Higher is better
• Execution time (or response time) means decrement:
• Lower is better

Cristina Silvano, 21/03/2021 7

Example
If machine A executes a program in 10 sec and machine B executes same
program in 15 sec:
A is 50% faster than B or A is 33% faster than B?
Solution:
• The statement A is n% faster than B can be expressed as:
• “A is n% faster than B”  execution time (B)= 1 +__n__
execution time (A) 100
 n = execution time (B)- execution time (A) *100
execution time (A)
(15 -10)/10 *100 = 50  A is 50% faster than B.

Cristina Silvano, 21/03/2021 8

Clock cycles
• TCLK = Period or clock cycle time = Time between two consecutive
clock pulses
• Seconds per cycle
• fCLK= Clock frequency = Clock cycles per second: fCLK = 1 / TCLK
• Where: 1 Hz = 1 / sec
• Examples:
• The clock frequency of 500 MHz corresponds to a clock cycle
time: 1 / (500 * 106) = 2 * 10 –9 = 2 nsec
• The clock frequency of 1 GHz corresponds to a clock cycle
time: 1 / (109) = 1 * 10 –9 = 1 nsec

Cristina Silvano, 21/03/2021 9

Execution time or CPU Time
CPU time = Clock Cycles x TCLK = Clock Cycles
fCLK

• To optimize performance means to reduce the execution time (or CPU

time):
• To reduce the number of clock cycles per program
• To reduce the clock period Tclk
• To increase the clock frequency fCLK

Cristina Silvano, 21/03/2021 10

CPU Time
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

CPU time = Seconds = IC x CPI x TCLK = IC x CPI

Program fCLK

• where the Clock Per Instruction is given by:

CPI = Clock Cycles / Instruction Count

• The Instruction Per Clock is given by: IPC = 1 / CPI

Cristina Silvano, 21/03/2021 11

CPU Time
CPU time = Clock Cycles x TCLK = Clock Cycles
fCLK
n
Where: Clock Cycles =  ( CPI x I ) i i
i =1
n
 CPU time =  ( CPI x I ) x T
i i CLK
i =1
n
CPI =  ( CPI x F )
i i
Where Fi = Ii
IC
i =1
"instruction frequency"

 CPU time = IC x CPI x TCLK = IC x  ( CPI i * Fi) x TCLK

Cristina Silvano, 21/03/2021 12

Example
Frequency Clock Cycles
ALU 43% 1
Load 21% 4
Store 12% 4
Branch 12% 2
Jump 12% 2

• Evaluate the CPI and the CPU time to execute a program

composed of 100 instructions mixed as in the table by using 500
MHz clock frequency:

CPI = 0.43 * 1 + 0.21 * 4 + 0.12 * 4 + 0.12 * 2 + 0.12 * 2 = 2.23

CPU time = IC * CPI * T clock = 100 * 2.23 * 2 ns = 446 ns

Cristina Silvano, 21/03/2021 13

MIPS Millions of instructions per second
MIPS = Instruction Count
Execution time x 106

Where: Execution time = IC x CPI

fCLK

MIPS = fCLK
CPI x 106

Cristina Silvano, 21/03/2021 14

Amdahl’s Law
How to evaluate the speedup
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = -------------------- = ---------------------
ExTime w/ E Performance w/o E

Suppose that enhancement E accelerates a fraction F of the task

by a factor S and the remainder of the task is unaffected then,

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Speedup(with E) = 1
(1-F) + F/S

Cristina Silvano, 21/03/2021 16

Amdahl’s Law
• Basic idea: Make the most common case fast
• Amdahl’s Law: The performance improvement to be gained
from using some faster execution modes is limited by the fraction
of the time the faster mode can be used. Let us assume:
• FractionE the fraction of the computation time in the original
machine that can be converted to take advantage of the
enhancement
• SpeedupE the improvement gained by the enhanced
execution mode
• The overall speed up is given by:

ExTimeold 1
Speedupoverall = -------------------- = ----------------------------------------
ExTimenew (1 – FractionE) + FractionE
SpeedupE

Cristina Silvano, 21/03/2021 17

Example
• Let us consider an enhancement for a CPU resulting ten time
faster on computation than the original one but the original CPU is
busy with computation only 40% of the time. What is the overall
speedup gained by introducing the enhancement?
• Solution: Application of Amdahl’s Law where:
• FractionE = 0.4
• SpeedupE = 10
• The overall speed up is given by:

1 1
Speedupoverall = ---------------------------------------- = ---------------- = 1.56
(1 – FractionE) + FractionE 0.6 + 0.04
SpeedupE

Cristina Silvano, 21/03/2021 18

Basis of Evaluation
Pros Cons
• very specific
• representative • non-portable
Actual Target Workload • difficult to run, or
measure
• hard to identify cause
• portable
• widely used •less representative
• improvements Full Application Benchmarks
useful in reality

• easy to run, early in Small “Kernel” • easy to “fool”

design cycle Benchmarks

• identify peak • “peak” may be a long

capability and Microbenchmarks way from application
potential bottlenecks performance

Cristina Silvano, 21/03/2021 19

Metrics of performance
Answers per month
Application
Useful Operations per second
Programming
Language
Compiler
(millions) of Instructions per second – MIPS
ISA (millions) of (F.P.) operations per second – MFLOP/s

Datapath
Control Megabytes per second
Function Units
TransistorsWires Pins Cycles per second (clock rate)

Each metric has a place and a purpose, and each can be misused

Cristina Silvano, 21/03/2021 20

Aspects of CPU Performance

CPU time= IC x CPI x TCLK

instr count CPI clock rate

Program X

Compiler X X

Instr. Set X X X

Organization X X

Technology X

Cristina Silvano, 21/03/2021 21

Performance evaluation in pipelined
processors
Performance Issues in Pipelining
• Pipelining increases the CPU instruction throughput (number of
instructions completed per unit of time), but it does not reduce
the execution time (latency) of a single instruction.
• Pipelining usually slightly increases the latency of each
instruction due to imbalance among the pipeline stages and
overhead in the control of the pipeline.
• Imbalance among pipeline stages reduces performance since
the clock can run no faster than the time needed for the
slowest pipe stage.
• Pipeline overhead arises from pipeline register delay and
clock skew.

- 23 -
Cristina Silvano, 21/03/2021
Performance Metrics
IC = Instruction Count

# Clock Cycles = IC + # Stall Cycles + 4

CPI = Clock Per Instruction = # Clock Cycles / IC =

(IC + # Stall Cycles + 4) / IC

MIPS = fclock / (CPI * 10 6)

- 24 -
Cristina Silvano, 21/03/2021
Example
IC = Instruction Count = 5
# Clock Cycles = IC + # Stall Cycles + 4 = 5 + 3 + 4 = 12
CPI = Clock Per Instruction = # Clock Cycles / IC = 12 / 5 = 2.4
MIPS = fclock / (CPI * 10 6) = 500 MHz / 2.4 * 10 6 = 208.3

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12

sub $2, $1, $3 IF ID EX ME WB

and $12, $2, $5 IF stall stall stall ID EX ME WB

IF ID EX ME WB
or $13, $6, $2
IF ID EX ME WB
add $14, $2, $2
IF ID EX ME WB
sw $15,100($2)

- 25 -
Cristina Silvano, 21/03/2021
Performance Metrics (2)
• Let us consider n iterations of a loop composed of m instructions per
iteration requiring k stalls per iteration

IC per_iter = m
# Clock Cycles per iter = IC per_iter + # Stall Cycles per_iter +4
CPI per_iter = (IC per iter + # Stall Cycles per_iter +4) /IC per_iter

= (m + k + 4) / m
MIPS per_iter = fclock / (CPI per_iter * 10 6)

- 26 -
Cristina Silvano, 21/03/2021
Asymptotic Performance Metrics
• Let us consider n iterations of a loop composed of m instructions
per iteration requiring k stalls per iteration

ICAS = Instruction Count AS = m * n

# Clock Cycles = IC AS + # Stall CyclesAS + 4
CPI AS = lim n ->  ( IC AS + # Stall CyclesAS + 4) /IC AS

= lim n ->  ( m *n + k * n + 4 ) / m * n
= (m + k) / m
MIPS AS = fclock / (CPIAS* 10 6)

- 27 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
• The ideal CPI on a pipelined processor would be 1, but stalls
cause the pipeline performance to degrade form the ideal
performance, so we have:

Ave. CPI Pipe = Ideal CPI + Pipe Stall Cycles per Instruction
= 1 + Pipe Stall Cycles per Instruction

• Pipe Stall Cycles per Instruction are due to Structural Hazards +

Data Hazards + Control Hazards + Memory Stalls

- 28 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
Pipeline Speedup = Ave. Exec. Time Unpipelined =
Ave. Exec. Time Pipelined
= Ave. CPI Unp. x Clock Cycle Unp. =
Ave. CPI Pipe Clock Cycle Pipe

- 29 -
Cristina Silvano, 21/03/2021
Performance Issues in Pipelining
• If we ignore the cycle time overhead of pipelining and we assume the
stages are perfectly balanced, the clock cycle time of two processors
can be equal, so:
Pipeline Speedup = __________Ave. CPI Unp._________
1 + Pipe Stall Cycles per Instruction
• Simple case: All instructions take the same number of cycles, which
must also equal to the number of pipeline stages (called pipeline
depth):
Pipeline Speedup = __________Pipeline Depth________
1 + Pipe Stall Cycles per Instruction
• If there are no pipeline stalls (ideal case), this leads to the intuitive
result that pipelining can improve performance by the depth of the
pipeline.

- 30 -
Cristina Silvano, 21/03/2021
Performance of Branch Schemes
• What is the performance impact of conditional branches?

Pipeline Speedup = _Pipeline Depth___

1 + Pipe Stall Cycles per Instruction due to Branches

= _______________Pipeline Depth_________________
1 + Branch Frequency x Branch Penalty

- 31 -
Cristina Silvano, 21/03/2021
Performance evaluation of the
memory hierarchy
Memory Hierarchy: Definitions
• Hit: data found in a block of the upper level
• Hit Rate: Number of memory accesses that find the data in the
upper level with respect to the total number of memory accesses
Hit Rate = # hits ______
# memory accesses

• Hit Time: time to access the data in the upper level of the
hierarchy, including the time needed to decide if the attempt of
access will result in a hit or miss

Cristina Silvano, 21/03/2021 33

Memory Hierarchy: Definitions
• Miss: the data must be taken from the lower level
• Miss Rate: number of memory accesses not finding the data in
the upper level with respect to the total number of memory
accesses
Miss Rate = # misses
# memory accesses
• By definition: Hit Rate + Miss Rate = 1

• Miss Penalty : time needed to access the lower level and to

replace the block in the upper level
• Miss Time = Hit Time + Miss Penalty
• Typically: Hit Time << Miss Penalty

Cristina Silvano, 21/03/2021 34

Cache Memory: Basic Concepts
Average Memory Access Time
AMAT = Hit Rate * Hit Time + Miss Rate * Miss Time

Being: Miss Time = Hit Time + Miss Penalty

=> AMAT = Hit Rate * Hit Time + Miss Rate * (Hit Time + Miss Penalty)

 AMAT = (Hit Rate + Miss Rate) * Hit Time + Miss Rate * Miss Penalty

By definition: Hit Rate + Miss Rate = 1

 AMAT= Hit Time + Miss Rate * Miss Penalty

Cristina Silvano, 21/03/2021 35

Performance evaluation:
Impact of memory hierarchy on CPUtime
CPUtime = (CPU exec cycles + Memory stall cycles) x TCLK

where: TCLK = T clock cycle time period

CPU exec cycles = IC x CPIexec
IC = Instruction Count
(CPIexec includes ALU and LD/STORE instructions)
Memory stall cycles = IC x Misses per instr x Miss Penalty

 CPUtime = IC x (CPIexec + Misses per instr x Miss penalty) x TCLK

where:
Misses per instr = Memory Accesses Per Instruction x Miss rate

 CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 36

Performance evaluation:
Impact of memory hierarchy on CPUtime
CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Let us consider an ideal cache (100% hits):

CPUtime = IC x CPIexec x TCLK

Let us consider a system without cache (100% misses):

CPUtime = IC x (CPIexec + MAPI x Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 37

Performance evaluation:
Impact of memory hierarchy and
pipeline stalls on CPUtime
CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Putting all together: Let us also consider the stalls due to pipeline hazards:

CPUtime = IC x (CPIexec + Stalls per instr + MAPI x Miss rate x

Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 38

Cache Performance

• Average Memory Access Time:

AMAT= Hit Time + Miss Rate * Miss Penalty

• How to improve cache performance:

1. Reduce the hit time
2. Reduce the miss rate
3. Reduce the miss penalty

Cristina Silvano, 21/03/2021 39

Unified Cache vs Separate I$ & D$
(Harvard architecture)
Processor Processor

Unified L1 cache I-cache L1 D-cache L1

To better exploit the locality principle

• Average Memory Access Time for Separate I$ & D$

AMAT= % Instr. (Hit Time + I$ Miss Rate * Miss Penalty) +
% Data (Hit Time + D$ Miss Rate * Miss Penalty)

• Usually: I$ Miss Rate << D$ Miss Rate

Cristina Silvano, 21/03/2021 40

Unified vs Separate I$ & D$:
Example of comparison
• Assumptions:
• 16KB I$ & D$: I$ Miss Rate=0.64% D$ Miss Rate=6.47%
• 32KB unified: Aggregate Miss Rate=1.99%
• Which is better?
• Assume 33% loads/stores (data ops)
 75% accesses from instructions (1.0/1.33)
 25% accesses from data (0.33/1.33)
• Hit time=1, Miss Penalty = 50
• Note data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05

AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Cristina Silvano, 21/03/2021 41

Miss Penalty Reduction:
Second Level Cache
Basic Idea:
• L1 cache small enough to match the fast CPU cycle time
• L2 cache large enough to capture many accesses that would go to
main memory reducing the effective miss penalty

Processor

L1 cache

L2 cache

Main Memory

Cristina Silvano, 21/03/2021 42

AMAT for L1 and L2 Caches

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

where: Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

 AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

 AMAT = Hit TimeL1 + Miss RateL1 x Hit TimeL2 + Miss RateL1 L2 x Miss PenaltyL2

Cristina Silvano, 21/03/2021 43

Local and global miss rates
• Definitions:
• Local miss rate: misses in this cache divided by the total number
of memory accesses to this cache: the Miss rateL1 for L1 and the
Miss rateL2 for L2
• Global miss rate: misses in this cache divided by the total
number of memory accesses generated by the CPU:
• for L1, the global miss rate is still just Miss RateL1
• for L2, it is (Miss RateL1 x Miss RateL2)
• Global miss rate is what really matters: it indicates what fraction
of memory accesses from CPU go all the way to main memory

Cristina Silvano, 21/03/2021 44

Example
• Let us consider a computer with a L1 cache and L2 cache
memory hierarchy. Suppose that in 1000 memory references
there are 40 misses in L1 and 20 misses in L2.
• What are the various miss rates?
Miss Rate L1 = 40 /1000 = 4% (either local or global)
Miss Rate L2 = 20 /40 = 50%
• Global Miss Rate for Last Level Cache (L2):
Miss Rate L1 L2 = Miss RateL1 x Miss RateL2=
(40 /1000) x (20 /40) = 2%

Cristina Silvano, 21/03/2021 45

Memory stalls per instructions for L1
and L2 caches
• Average memory stalls per instructions:
Memory stall cycles per instr = Misses per instr x Miss Penalty

• Average memory stalls per instructions for L1 and L2 caches:

Memory stall cycles per instr =
MissesL1 per instr X Hit TimeL2 + MissesL2 per instr X Miss PenaltyL2

Cristina Silvano, 21/03/2021 46

Impact of L1 and L2 on CPUtime
CPUtime = IC x (CPIexec + Memory stall cycles per instr) x TCLK

where:
Memory stall cycles per instr =
MissesL1 per instr X Hit TimeL2 + MissesL2 per instr X Miss PenaltyL2
MissesL1 per instr = Memory Accesses Per Instr x Miss RateL1
MissesL2 per instr = Memory Accesses Per Instr x Miss RateL1 L2

CPUtime= IC x (CPIexec + MAPI x MRL1 x HTL2 + MAPI x MRL1 L2 x MPL2 ) x TCLK

Cristina Silvano, 21/03/2021 47

References
• Chapter 1 of Text Book:
J. Hennessey, D. Patterson,
“Computer Architecture: a quantitative approach”
5 th Edition, Morgan-Kaufmann Publishers.

- 48 -

C A Lecture-3
No ratings yet
C A Lecture-3
41 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Quatitative Principle
No ratings yet
Quatitative Principle
56 pages
Chapter 2 A: Performance
No ratings yet
Chapter 2 A: Performance
33 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
Performance
No ratings yet
Performance
51 pages
2 CPU Performance
No ratings yet
2 CPU Performance
35 pages
Lecture 02 CH01 Performance Power
No ratings yet
Lecture 02 CH01 Performance Power
76 pages
4 Performance
No ratings yet
4 Performance
27 pages
02 Performance
No ratings yet
02 Performance
23 pages
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
No ratings yet
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
56 pages
Performance Measures For Computers
No ratings yet
Performance Measures For Computers
53 pages
Puter Performance
No ratings yet
Puter Performance
15 pages
SEN307 Lecture 5
No ratings yet
SEN307 Lecture 5
34 pages
Module 2 (26-10-2024)
No ratings yet
Module 2 (26-10-2024)
50 pages
Lecture # 2
No ratings yet
Lecture # 2
33 pages
Lec 3
No ratings yet
Lec 3
21 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
Measuring Performance: Chris Clack B261 Systems Architecture
No ratings yet
Measuring Performance: Chris Clack B261 Systems Architecture
19 pages
1aca L1
No ratings yet
1aca L1
35 pages
Week 10 Part 02 - Processor Performance (Q Only) - Tagged 2
No ratings yet
Week 10 Part 02 - Processor Performance (Q Only) - Tagged 2
23 pages
Performance
No ratings yet
Performance
23 pages
09 Perf
No ratings yet
09 Perf
22 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
18 pages
Cse - 321 - 2
No ratings yet
Cse - 321 - 2
37 pages
CSE 332 L4 - 14 Nov 2020
No ratings yet
CSE 332 L4 - 14 Nov 2020
41 pages
Lec10 Performance
No ratings yet
Lec10 Performance
22 pages
Computer Performance
No ratings yet
Computer Performance
18 pages
2 - Computer Organization and Architecture
No ratings yet
2 - Computer Organization and Architecture
21 pages
The Role of Performance: Chapter - 2
No ratings yet
The Role of Performance: Chapter - 2
40 pages
Computer Performance
No ratings yet
Computer Performance
17 pages
Performance Matrices
No ratings yet
Performance Matrices
14 pages
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
No ratings yet
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
41 pages
Computer Architecture Measurement
No ratings yet
Computer Architecture Measurement
26 pages
4 Perfrmance
No ratings yet
4 Perfrmance
30 pages
Performance Measures
No ratings yet
Performance Measures
25 pages
Lecture4 Performance Evaluation 2011
No ratings yet
Lecture4 Performance Evaluation 2011
34 pages
Assessing and Understanding Performance
No ratings yet
Assessing and Understanding Performance
31 pages
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
No ratings yet
CS322 - Computer Architecture (CA) : Spring 2019 Section V3
52 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
It3030e CA Chap1 Introduction 2.0m
No ratings yet
It3030e CA Chap1 Introduction 2.0m
25 pages
Week 2 - Lecture 2 - Performance Measurement
No ratings yet
Week 2 - Lecture 2 - Performance Measurement
25 pages
Computer Architecture Unit1
No ratings yet
Computer Architecture Unit1
20 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Performance
No ratings yet
Performance
4 pages
Computer Performance
No ratings yet
Computer Performance
22 pages
Lesson 3 - Computing For Performance
No ratings yet
Lesson 3 - Computing For Performance
38 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Computer Organization The Role of Performance
No ratings yet
Computer Organization The Role of Performance
45 pages
ACA Lec2 New
No ratings yet
ACA Lec2 New
44 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
Computer Architecture Measuring Performance
No ratings yet
Computer Architecture Measuring Performance
33 pages
Chapter 1 Performance
No ratings yet
Chapter 1 Performance
32 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
13 pages
CH 02a-Computer Performance
No ratings yet
CH 02a-Computer Performance
22 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
28 pages
Module 3.3 - Problems On Performance
No ratings yet
Module 3.3 - Problems On Performance
54 pages
Chapter 1 Lecture 2 & 3 - Performance
No ratings yet
Chapter 1 Lecture 2 & 3 - Performance
36 pages
Lecture Ch4 Performance
No ratings yet
Lecture Ch4 Performance
25 pages
Tracking - 3X Per Week - BL & FL
No ratings yet
Tracking - 3X Per Week - BL & FL
16 pages
VLSI Master Guide
No ratings yet
VLSI Master Guide
61 pages
1 Introduction To VLSI Design
No ratings yet
1 Introduction To VLSI Design
42 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
10 pages
Tracking - 4X Per Week - BL - FL
No ratings yet
Tracking - 4X Per Week - BL - FL
16 pages
Asus Prime Z390-A Rev1.02 (60mb0yt0-Mb0a01)
No ratings yet
Asus Prime Z390-A Rev1.02 (60mb0yt0-Mb0a01)
121 pages
Second Part Notes
No ratings yet
Second Part Notes
16 pages
Cache Memory
No ratings yet
Cache Memory
12 pages
Microprocessors and Microcontrollers: Instructor Name: Prof. Santanu Chattopadhyay Institute: IIT Kharagpur
100% (1)
Microprocessors and Microcontrollers: Instructor Name: Prof. Santanu Chattopadhyay Institute: IIT Kharagpur
1 page
Electronic Packaging
No ratings yet
Electronic Packaging
4 pages
Module 6 CO 2020
No ratings yet
Module 6 CO 2020
40 pages
Asynchronous I-III (Austria, Jiever Neil N.)
No ratings yet
Asynchronous I-III (Austria, Jiever Neil N.)
11 pages
Seminar
No ratings yet
Seminar
85 pages
MIPS: The I/O Interface
No ratings yet
MIPS: The I/O Interface
22 pages
Design of 3d Integrated Circuits and Systems Rohit Sharma Download
No ratings yet
Design of 3d Integrated Circuits and Systems Rohit Sharma Download
90 pages
Bis2 Latex
No ratings yet
Bis2 Latex
85 pages
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
No ratings yet
Memory Subsystem: Dr. Gayathri Sivakumar Assistant Professor (SG-I) School of Electronics VIT, Chennai
16 pages
Unit Iii
No ratings yet
Unit Iii
25 pages
Esd 1
No ratings yet
Esd 1
40 pages
pdf24 Merged
No ratings yet
pdf24 Merged
225 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
99 pages
CSE3117-Lecture 1-Introduction
No ratings yet
CSE3117-Lecture 1-Introduction
22 pages
(PDF Download) Solutions Manual To Accompany CMOS VLSI Design 3rd Edition 9780321149015 Fulll Chapter
100% (5)
(PDF Download) Solutions Manual To Accompany CMOS VLSI Design 3rd Edition 9780321149015 Fulll Chapter
49 pages
MC Unit 1
No ratings yet
MC Unit 1
17 pages
8254 Timer
No ratings yet
8254 Timer
9 pages
L02 Branch Prediction V2021
No ratings yet
L02 Branch Prediction V2021
82 pages
Logic Gate PDF
No ratings yet
Logic Gate PDF
14 pages
L12A Introduction To Multiprocessors Part I
No ratings yet
L12A Introduction To Multiprocessors Part I
61 pages
I I I I R/W: Input Buffer E
No ratings yet
I I I I R/W: Input Buffer E
4 pages
Lecture10 - MOSFET
No ratings yet
Lecture10 - MOSFET
18 pages
18BCS37S U1
No ratings yet
18BCS37S U1
32 pages
PDF 11 Problems On Vat With Answers and Solutions - Compress
No ratings yet
PDF 11 Problems On Vat With Answers and Solutions - Compress
29 pages
Enter The Input Data in Memory Location 6200 AND 6201. Enter The Above Opcodes From 6000. Execute The Program. Result Stored in 6202
No ratings yet
Enter The Input Data in Memory Location 6200 AND 6201. Enter The Above Opcodes From 6000. Execute The Program. Result Stored in 6202
27 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
Business Information Systems: Chiara Francalanci Michele Brustia Francesco Frugiuele Alessandra Lieto Lucia Piolidori
No ratings yet
Business Information Systems: Chiara Francalanci Michele Brustia Francesco Frugiuele Alessandra Lieto Lucia Piolidori
8 pages
Basal Metabolic Rate and Calorie Need Calculator Bulking Plan - Men
No ratings yet
Basal Metabolic Rate and Calorie Need Calculator Bulking Plan - Men
13 pages
Ball Intium 2 Processor
No ratings yet
Ball Intium 2 Processor
15 pages
DB2 20210122 Text and Solutions
No ratings yet
DB2 20210122 Text and Solutions
7 pages
JPA Exercises Without Solution
No ratings yet
JPA Exercises Without Solution
5 pages
De 7321
No ratings yet
De 7321
2 pages
5.1 Customer Desire Mind Reading Technique
No ratings yet
5.1 Customer Desire Mind Reading Technique
3 pages
Exam1: Q1) What Is The Control Register in 80386DX in Protected Model?
No ratings yet
Exam1: Q1) What Is The Control Register in 80386DX in Protected Model?
3 pages
Atmel Example List
No ratings yet
Atmel Example List
1 page
Business Information Systems 1: Link To Download Program For Testing
No ratings yet
Business Information Systems 1: Link To Download Program For Testing
1 page
Istantanea Schermo 2021-02-09 (11.09.17)
No ratings yet
Istantanea Schermo 2021-02-09 (11.09.17)
1 page
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet

L14 Introduction To Performance Evaluation

Uploaded by

L14 Introduction To Performance Evaluation

Uploaded by

Course on: “Advanced Computer Architectures”

Prof. Cristina Silvano

Cristina Silvano, 21/03/2021 3

Boeing 747 6.5 hours 610 mph 470 286,700

Which has higher performance?

Cristina Silvano, 21/03/2021 4

• Time of Concorde vs. Boeing 747?

• Boeing is 1.6 times (“60%”) faster in terms of throughput

We will focus primarily on execution time for a single job

Cristina Silvano, 21/03/2021 5

• “X is n% faster than Y”  performance(x) = 1 + __n__

Cristina Silvano, 21/03/2021 6

• Performance improvement means increment:

Cristina Silvano, 21/03/2021 7

Cristina Silvano, 21/03/2021 8

Cristina Silvano, 21/03/2021 9

• To optimize performance means to reduce the execution time (or CPU

Cristina Silvano, 21/03/2021 10

CPU time = Seconds = IC x CPI x TCLK = IC x CPI

• where the Clock Per Instruction is given by:

• The Instruction Per Clock is given by: IPC = 1 / CPI

Cristina Silvano, 21/03/2021 11

 CPU time = IC x CPI x TCLK = IC x  ( CPI i * Fi) x TCLK

Cristina Silvano, 21/03/2021 12

• Evaluate the CPI and the CPU time to execute a program

CPI = 0.43 * 1 + 0.21 * 4 + 0.12 * 4 + 0.12 * 2 + 0.12 * 2 = 2.23

CPU time = IC * CPI * T clock = 100 * 2.23 * 2 ns = 446 ns

Cristina Silvano, 21/03/2021 13

Where: Execution time = IC x CPI

Cristina Silvano, 21/03/2021 14

Suppose that enhancement E accelerates a fraction F of the task

ExTime(with E) = ((1-F) + F/S) X ExTime(without E)

Cristina Silvano, 21/03/2021 16

Cristina Silvano, 21/03/2021 17

Cristina Silvano, 21/03/2021 18

• easy to run, early in Small “Kernel” • easy to “fool”

• identify peak • “peak” may be a long

Cristina Silvano, 21/03/2021 19

Cristina Silvano, 21/03/2021 20

CPU time= IC x CPI x TCLK

instr count CPI clock rate

Cristina Silvano, 21/03/2021 21

# Clock Cycles = IC + # Stall Cycles + 4

CPI = Clock Per Instruction = # Clock Cycles / IC =

MIPS = fclock / (CPI * 10 6)

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12

and $12, $2, $5 IF stall stall stall ID EX ME WB

ICAS = Instruction Count AS = m * n

• Pipe Stall Cycles per Instruction are due to Structural Hazards +

Pipeline Speedup = _______________Pipeline Depth_________________

Cristina Silvano, 21/03/2021 33

• Miss Penalty : time needed to access the lower level and to

Cristina Silvano, 21/03/2021 34

Being: Miss Time = Hit Time + Miss Penalty

By definition: Hit Rate + Miss Rate = 1

 AMAT= Hit Time + Miss Rate * Miss Penalty

Cristina Silvano, 21/03/2021 35

where: TCLK = T clock cycle time period

 CPUtime = IC x (CPIexec + Misses per instr x Miss penalty) x TCLK

 CPUtime = IC x (CPIexec + MAPI x Miss rate x Miss penalty) x TCLK

Cristina Silvano, 21/03/2021 36

Let us consider an ideal cache (100% hits):

Let us consider a system without cache (100% misses):

Cristina Silvano, 21/03/2021 37

CPUtime = IC x (CPIexec + Stalls per instr + MAPI x Miss rate x

Cristina Silvano, 21/03/2021 38

• Average Memory Access Time:

• How to improve cache performance:

Cristina Silvano, 21/03/2021 39

Unified L1 cache I-cache L1 D-cache L1

To better exploit the locality principle

• Average Memory Access Time for Separate I$ & D$

• Usually: I$ Miss Rate << D$ Miss Rate

Cristina Silvano, 21/03/2021 40

Cristina Silvano, 21/03/2021 41

Cristina Silvano, 21/03/2021 42

• “X is n% faster than Y”  performance(x) = 1 + n

Pipeline Speedup = _Pipeline Depth___