0% found this document useful (0 votes)

48 views67 pages

4 Performance

The document discusses measuring and reporting computer system performance. It covers topics like CPU components, memory, caches, benchmarks, and performance metrics. Effective performance measurement requires considering multiple dimensions like execution time by instruction type, cache bandwidth, and I/O performance.

Uploaded by

Laith Qasem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views67 pages

4 Performance

Uploaded by

Laith Qasem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Measuring & Reporting

Performance

Aziz Qaroush
Review: Computer System Components
CPU Core
1 GHz - 3.8 GHz
4-way Superscaler
All Non-blocking caches
RISC or RISC-core (x86):
Deep Instruction Pipelines
L1 16-128K 1-2 way set associative (on chip), separate or unified
Dynamic scheduling L1 L2 256K- 2M 4-32 way set associative (on chip) unified
Multiple FP, integer FUs CPU L3 2-16M 8-32 way set associative (off or on chip) unified
Dynamic branch prediction L2
Hardware speculation Examples: Alpha, AMD K7: EV6, 200-400 MHz
L3 Intel PII, PIII: GTL+ 133 MHz
SDRAM Caches Intel P4 800 MHz
PC100/PC133
100-133MHZ Front Side Bus (FSB)
64-128 bits wide
2-way inteleaved Off or On-chip
~ 900 MBYTES/SEC )64bit) Memory
adapters I/O Buses
Current Standard Controller Example: PCI, 33-66MHz
32-64 bits wide
Double Date 133-528 MBYTES/SEC
Rate (DDR) SDRAM Memory Bus
PC3200
Controllers NICs PCI-X 133MHz 64 bit
1024 MBYTES/SEC
200 MHZ DDR
64-128 bits wide Memory
4-way interleaved Disks
~3.2 GBYTES/SEC
(one 64bit channel)
Displays Networks
~6.4 GBYTES/SEC Keyboards
(two 64bit channels)

RAMbus DRAM (RDRAM) I/O Devices:

North South
400MHZ DDR
16 bits wide (32 banks) Bridge Bridge I/O Subsystem
~ 1.6 GBYTES/SEC Chipset
2
Architecture continually changing
Applications
suggest how Improved
Application
to improve technologies
s
technology, make new
provide applications
revenue to possible
Technology
fund
development

Cost of software development

makes compatibility a major
force in market
3
Review: What is Computer Architecture?

I/O Chan
Link
ISA
API
Interfaces
Technology
IR

Regs

Machine Organization

Computer
Applications
Architect
Measurement &
Evaluation

4
The Architecture Process

Estimate
Cost & Sort
Performance

New concepts
created
Good
Mediocre ideas
Bad ideas
ideas

5
What is Performance?
• How can we make intelligent choices about computers?

• Why is some computer hardware performs better at

some programs, but performs less at other programs?

• How do we measure the performance of a computer?

• What factors are hardware related? software related?

• How does machine’s instruction set affect performance?

• Understanding performance is key to understanding

underlying organizational motivation
Measuring performance
• We need measures
– Comparison of machine properties
– Comparison of software properties (compilers)
• Purpose
– Making purchase decisions
– Development of new architectures
• Is a single measure sufficient?
– A machine with 600 MHz clock cycle is faster than 500 MHz
clock cycle!?
– Why do we still have mainframes?

7
Performance Measurement and Evaluation
• Many dimensions to
computer performance P
– CPU execution time
• by instruction or sequence
– floating point
– integer C
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance M
• bandwidth
• seeks
• pixels or polygons per
second
• Relative importance
depends on applications

8
Evaluation Tools
• Benchmarks, traces, & mixes
– macrobenchmarks & suites
MOVE 39%
• application execution time BR 20%
– microbenchmarks LOAD 20%
• measure one aspect of STORE 10%
performance ALU 11%
– traces
• replay recorded accesses
– cache, branch, register
• Simulation at many levels
– ISA, cycle accurate, RTL, gate,
circuit
• trade fidelity for simulation rate
• Area and delay estimation
• Analysis
– e.g., queuing theory
– Fundamentals Laws

9
Metrics of Computer Performance

Application Answers per month

Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA (millions) of (FP) operations per second: MFLOP/s
Datapath
Control Megabytes per second
Function Units
Transistors Wires Pins Cycles per second (clock rate)

Each metric has a purpose, and each can be misused.

10
Benchmarks and Benchmarking

Some definitions are:

• It is a test that measures the performance
of a system or subsystem on a well-
defined task or set of task.
• A method of comparing the performance
of different computer architecture.
• Or a method of comparing the
performance of different software
Some Warnings about Benchmarks

• Benchmarks measure the • Benchmark timings often

whole system very sensitive to
– application – alignment in cache
– compiler – location of data on disk
– operating system – values of data
– architecture • Benchmarks can lead to
– implementation inbreeding or positive
• Popular benchmarks feedback
typically reflect yesterday’s – if you make an operation
programs fast (slow) it will be used
– computers need to be more (less) often
designed for tomorrow’s • so you make it faster
programs (slower)
– and it gets used even
more (less)
» and so on…

12
Choosing Programs To Evaluate Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the target
machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical of targeted
applications or workload (e.g SPEC95, SPEC CPU2000).
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific aspect of
performance characteristics: Processing: integer, floating point, local
memory, input/output, etc.

13
Types of Benchmarks
Pros Cons
• Very specific.
• Representative Actual Target Workload • Non-portable.
• Complex: Difficult
to run, or measure.

• Portable.
• Widely used. • Less representative
Full Application Benchmarks
• Measurements than actual workload.
useful in reality.

Small “Kernel” • Easy to “fool” by

• Easy to run, early in designing hardware
the design cycle. Benchmarks
to run them well.

• Peak performance
• Identify peak results may be a long
performance and Microbenchmarks
way from real application
potential bottlenecks. performance
SPEC: System Performance Evaluation
Cooperative
The most popular and industry-standard set of CPU
benchmarks.
• SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs).
• SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a score
of SPECint95 = SPECfp95 = 1
• SPEC CPU2000, 1999:
– CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive
programs)
– Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score
of SPECint2000 = SPECfp2000 = 100
15
Application Performance: Intel Core i9-
12900K vs Ryzen 9 5950X and Ryzen 9 5900X

16
Power Consumption, Efficiency, and Cooling: Intel
Core i9-12900K vs Ryzen 9 5950X and Ryzen 9 5900X

17
Architectural Performance Laws and Rules
of Thumb
• Measurement and Evaluation
– Architecture is an iterative process:
• Searching the space of possible designs
• Make selections
• Evaluate the selections made
– Good measurement tools are required to accurately evaluate the
selection.
• Measurement Tools
– Benchmarks, Traces, Mixes
– Cost, delay, area, power estimation
– Simulation (many levels)
• ISA, RTL, Gate, Circuit
– Queuing Theory
– Rules of Thumb
– Fundamental Laws

18
Time as a Measure of Performance
• Response Time
– Time between start and completion of a task

– As observed and measured by the end user

– Called also Wall-Clock Time or Elapsed Time

– Response Time = CPU Time + Waiting Time (I/O, scheduling, etc.)

• CPU Execution Time

– Time spent executing the program instructions

– CPU time = User CPU time + Kernel CPU time

– Can be measured in seconds. msec, µsec, etc.

– Can be related to the number of CPU clock cycles

• Our focus: user CPU time

– Time spent executing the lines of code that are "in" our program
Throughput as a Performance Metric
• Throughput = Total work done per unit of time
– Tasks per hour
– Transactions per minute

• Decreasing the execution time improves throughput

– Example: using a faster version of a processor

– Less time to run a task  more tasks can be executed per unit of time

• Parallel hardware improves throughput and response time

– By increasing the number of processors in a multiprocessor

– More tasks can be executed in parallel

– Execution time of individual sequential tasks is not changed

– Less waiting time in queues reduces (improves) response time

CPU Performance Evaluation
• Most computers run synchronously utilizing a CPU clock running at
Clock cycle
a constant clock rate:
Cycles/sec = Hertz = Hz

cycle 1 cycle 2 cycle 3

where: Clock rate = 1 / clock cycle
• The CPU clock rate depends on the specific CPU organization
(design) and hardware implementation technology (VLSI) used
• A computer machine (ISA) instruction is comprised of a number of
elementary or micro operations which vary in number and complexity
depending on the instruction and the exact CPU organization
(Design)
– A micro operation is an elementary hardware operation that can be performed
during one CPU clock cycle.
– This corresponds to one micro-instruction in microprogrammed CPUs.
– Examples: register operations: shift, load, clear, increment, ALU operations: add ,
subtract, etc.
• Thus a single machine instruction may take one or more CPU cycles
to complete termed as the Cycles Per Instruction (CPI).
– Average CPI of a program: The average CPI of all instructions executed in the
program on a given CPU design.
21
Generic CPU Machine Instruction Execution
Steps

22
Computer Performance Measures: Program
Execution Time
• For a specific program compiled to run on a specific
machine (CPU) “A”, has the following parameters:
– The total executed instruction count of the program. I
– The average number of cycles per instruction (average CPI). CPI
– Clock cycle of machine “A” C
• How can one measure the performance of this machine
(CPU) running this program?
– Intuitively the machine (or CPU) is said to be faster or has
better performance running this program if the total execution
time is shorter.
– Thus the inverse of the total measured program execution
time is a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve
performance? 23
Comparing Computer Performance Using
Execution Time
• To compare the performance of two machines (or CPUs) “A”, “B” running a
given specific program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
• Machine A is n times faster than machine B means (or slower? if n < 1)
PerformanceA Execution TimeB
Speedup = n = =
PerformanceB Execution TimeA

• Example: (i.e Speedup is ratio of performance, no units)

For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
Speedup= PerformanceA / PerformanceB = Execution TimeB / Execution TimeA
= 10 / 1 = 10
The performance of machine A is 10 times the performance of
machine B when running this program, or: Machine A is said to be 10
times faster than machine B when running this program.
24
CPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions
executed , I
– Measured in: instructions/program
• The average instruction executed takes a number of cycles
Or Instructions Per Cycle (IPC):
per instruction (CPI) to be completed. IPC= 1/CPI
– Measured in: cycles/instruction, CPI
• CPU has a fixed clock cycle time C = 1/clock rate
– Measured in: seconds/cycle
• CPU execution time is the product of the above three
parameters as follows:
Executed

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

T = I x CPI x C
Execution Time Number of Average CPI for program CPU Clock Cycle
per program in seconds instructions executed
25
(This equation is commonly known as the CPU performance equation)
CPU Average CPI/Execution Time
For a given program executed on a given machine (CPU):

CPI = Total program execution cycles / Instructions count

(average)

→ CPU clock cycles = Instruction count x CPI

CPU execution time =

= CPU clock cycles x Clock cycle

= Instruction count x CPI x Clock cycle
T = I x CPI x C

execution Time Number of Average CPI CPU Clock Cycle

per program in seconds instructions executed for program

(This equation is commonly known as the CPU performance equation) 26

Improving the performance

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

• Increase the clock frequency =

– reduce the clock period
• Reduce the number of cycles for the program
• Reduce the number of instructions

27
Instruction = cycle?
• Is the number of cycles identical with the number of
instructions?
– No!
• The number of cycles depends on the implementation of the
operations in hardware
– The number differs for each processor
– Why?
• Operations take different time
– Multiplication takes longer than addition
– Floating point operations take longer than integer operations
• The access time to a register is much shorter than to memory
location

28
Aspects of CPU Execution Time
CPU Time = Instruction count x CPI x Clock cycle

Depends on:
T = I x CPI x C

Program Used
Compiler
ISA

Instruction Count I

(executed)

Depends on:
Depends on:
Program Used
CPI Clock CPU Organization
Compiler Cycle Technology (VLSI)
ISA (Average C
CPU Organization CPI)
29
Factors Affecting CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

Instruction
CPI Clock Cycle C
Count I
Program X X
Compiler X X
Instruction Set
Architecture (ISA) X X
Organization X X
(CPU Design)

Technology X
(VLSI)

30
CPU Execution Time: Example
• A Program is running on a specific machine (CPU) with the
following parameters:
– Total executed instruction count: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds)
• What is the execution time for this program:

CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

CPU time = Instruction count x CPI x Clock cycle

= 10,000,000 x 2.5 x 1 / clock rate
= 10,000,000 x 2.5 x 5x10-9
= .125 seconds
T = I x CPI x C
31
Performance Comparison: Example
• From the previous example: A Program is running on a specific
machine (CPU) with the following parameters:
– Total executed instruction count, I: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
• Using the same program with these changes:
– A new compiler used: New executed instruction count, I: 9,500,000
New CPI: 3.0
– Faster CPU implementation: New clock rate = 300 MHz
• What is the speedup with the changes?
Speedup = Old Execution Time = Iold x CPIold x Clock cycleold
New Execution Time Inew x CPInew x Clock Cyclenew

Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 )

= .125 / .095 = 1.32
or 32 % faster after changes.

Clock Cycle = 1/ Clock Rate T = I x CPI x C 32

Instruction Types & CPI
• Given a program with n types or classes of instructions
executed on a given CPU with the following
characteristics:
i = 1, 2, …. n
Ci = Count of instructions of typei executed
CPIi = Cycles per instruction for typei
Then:
CPI = CPU Clock Cycles / Instruction Count I

CPU clockcycles =  (CPI i  C i )

Where: n

i =1

Executed Instruction Count I = S Ci

33
Instruction Types & CPI: An Example
• An instruction set has three instruction classes:
Instruction class CPI
A 1 For a specific
CPU design
B 2
C 3
• Two code sequences have the following instruction counts:
Instruction counts for instruction class
Code Sequence A B C
1 2 1 2
2 4 1 1
• CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
• CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5

(CPI  C )
n
CPU clock cycles =  i i
CPI = CPU Cycles / I 34
i =1
Instruction Frequency & CPI
• Given a program with n types or classes of instructions with
the following characteristics:

Ci = Count of instructions of typei i = 1, 2, …. n

CPIi = Average cycles per instruction of typei
Fi = Frequency or fraction of instruction typei executed
= Ci/ total executed instruction count = Ci/ I
Then:

CPI =  (CPI i  F i )
n

i =1

CPIi x Fi
Fraction of total execution time for instructions of type i =
CPI
35
Instruction Type Frequency & CPI:
A RISC Example
CPIi x Fi
Program Profile or Executed Instructions Mix
CPI
Base Machine (Reg / Reg)
Op Freq, Fi CPIi CPIi x Fi % Time
Given ALU 50% 1 .5 23% = .5/2.2
Load 20% 5 1.0 45% = 1/2.2
Store 10% 3 .3 14% = .3/2.2
Branch 20% 2 .4 18% = .4/2.2

Typical Mix
Sum = 2.2

CPI =  (CPI i  F i )
n

i =1

CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
= .5 + 1 + .3 + .4
36
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• For a specific program running on a specific CPU the MIPS rating is a
measure of how many millions of instructions are executed per
second:
MIPS Rating = Instruction count / (Execution Time x 106)
= Instruction count / (CPU clocks x Cycle time x 106)
= (Instruction count x Clock rate) / (Instruction count x CPI x 106)
= Clock rate / (CPI x 106)
• Major problem with MIPS rating: As shown above the MIPS rating
does not account for the count of instructions executed (I).
– A higher MIPS rating in many cases may not mean higher performance or
better execution time. i.e. due to compiler design variations.
• In addition the MIPS rating:
– Does not account for the instruction set architecture (ISA) used.
• Thus it cannot be used to compare computers/CPUs with different instruction
sets.
– Easy to abuse: Program used to get the MIPS rating is often omitted.
• Often the Peak MIPS rating is provided for a given CPU which is obtained
using a program comprised entirely of instructions with the lowest CPI for the
given CPU design which does not represent real programs. 37
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• Under what conditions can the MIPS rating be used to
compare performance of different CPUs?
• The MIPS rating is only valid to compare the performance of
different CPUs provided that the following conditions are
satisfied:
1 The same program is used
(actually this applies to all performance metrics)
2 The same ISA is used
3 The same compiler is used
 (Thus the resulting programs used to run on the CPUs
and obtain the MIPS rating are identical at the machine
code level including the same instruction count)

38
Wrong!!!
• 3 significant problems with using MIPS:
– Problem 1:
• MIPS is instruction set dependent.
• (And different computer brands usually have different instruction
sets)

– Problem 2:
• MIPS varies between programs on the same computer

– Problem 3:
• MIPS can vary inversely to performance!
• Let’s look at an examples of why MIPS doesn’t work…

39
Compiler Variations, MIPS & Performance:
An Example
• For a machine (CPU) with instruction classes:

Instruction class CPI

A 1
B 2
C 3

• For a given high-level language program, two compilers

produced the following executed instruction counts:
Instruction counts (in millions)
for each instruction class
Code from: A B C
Compiler 1 5 1 1
Compiler 2 10 1 1

• The machine is assumed to run at a clock rate of 100 MHz.

40
Compiler Variations, MIPS & Performance:
An Example (Continued)
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
CPI = CPU execution cycles / Instructions count

(CPI  C )
n
CPU clock cycles =  i i
i =1
CPU time = Instruction count x CPI / Clock rate

• For compiler 1:
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIPS Rating1 = 100 / (1.428 x 106) = 70.0 MIPS
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds

• For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIPS Rating2 = 100 / (1.25 x 106) = 80.0 MIPS
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds

MIPS rating indicates that compiler 2 is better

while in reality the code produced by compiler 1 is faster 41
MIPS (The ISA not the metric) Loop High Memory

Performance Example $6 points here

X[999]
X[998]
Last element to
compute
For the loop:
.
for (i=0; i<1000; i=i+1){ .
x[i] = x[i] + s; } .
.
$2 initially
MIPS assembly code is given by: points here X[0] First element to
lw $3, 8($1) ; load s in $3 Low Memory compute

addi $6, $2, 4000 ; $6 = address of last element + 4

loop: lw $4, 0($2) ; load x[i] in $4
add $5, $4, $3 ; $5 has x[i] + s
sw $5, 0($2) ; store computed x[i]
addi $2, $2, 4 ; increment $2 to point to next x[ ] element
bne $6, $2, loop ; last loop iteration reached?
The MIPS code is executed on a specific CPU that runs at 500 MHz (clock cycle = 2ns =
2x10-9 seconds) with following instruction type CPIs :

For this MIPS code running on this CPU find:

Instruction type CPI 1- Fraction of total instructions executed for each instruction type
ALU 4 2- Total number of CPU cycles
Load 5 3- Average CPI
Store 7 4- Fraction of total execution time for each instructions type
Branch 3 5- Execution time
6- MIPS rating , peak MIPS rating for this CPU
X[ ] array of words in memory, base address in $2 , 42
s a constant word value in memory, address in $1
MIPS (The ISA) Loop Performance
Example (continued)
• The code has 2 instructions before the loop and 5 instructions in the body of the loop which iterates
1000 times,
• Thus: Total instructions executed, I = 5x1000 + 2 = 5002 instructions
1 Number of instructions executed/fraction Fi for each instruction type:
– ALU instructions = 1 + 2x1000 = 2001 CPIALU = 4 FractionALU = FALU = 2001/5002 = 0.4 = 40%
– Load instructions = 1 + 1x1000 = 1001 CPILoad = 5 FractionLoad = FLoad = 1001/5002= 0.2 = 20%
– Store instructions = 1000 CPIStore = 7 FractionStore = FStore = 1000/5002 = 0.2 = 20%
– Branch instructions = 1000 CPIBranch = 3 FractionBranch= FBranch = 1000/5002= 0.2 = 20%

 (CPI  C )
n
Instruction type CPI
2 CPU clock cycles = i i
i =1 ALU 4
Load 5
= 2001x4 + 1001x5 + 1000x7 + 1000x3 = 23009 cycles Store 7
3 Average CPI = CPU clock cycles / I = 23009/5002 = 4.6 Branch 3
4 Fraction of execution time for each instruction type:
– Fraction of time for ALU instructions = CPIALU x FALU / CPI= 4x0.4/4.6 = 0.348 = 34.8%
– Fraction of time for load instructions = CPIload x Fload / CPI= 5x0.2/4.6 = 0.217 = 21.7%
– Fraction of time for store instructions = CPIstore x Fstore / CPI= 7x0.2/4.6 = 0.304 = 30.4%
– Fraction of time for branch instructions = CPIbranch x Fbranch / CPI= 3x0.2/4.6 = 0.13 = 13%
5 Execution time = I x CPI x C = CPU cycles x C = 23009 x 2x10-9 =
= 4.6x 10-5 seconds = 0.046 msec = 46 usec
6 MIPS rating = Clock rate / (CPI x 106) = 500 / 4.6 = 108.7 MIPS
– The CPU achieves its peak MIPS rating when executing a program that only has instructions of the type with
the lowest CPI. In this case branches with CPIBranch = 3
– Peak MIPS rating = Clock rate / (CPIBranch x 106) = 500/3 = 166.67 MIPS 43
Computer Performance Measures :MFLOPS
• A floating-point operation is an addition, subtraction,
multiplication, or division operation applied to numbers
represented by a single or a double precision floating-point
representation.
• MFLOPS, for a specific program running on a specific
computer, is a measure of millions of floating point-operation
(megaflops) per second:
MFLOPS =
Number of floating-point operations / (Execution time x 106 )

• MFLOPS rating is a better comparison measure between

different machines (applies even if ISAs are different) than the
MIPS rating.
– Applicable even if ISAs are different

44
Computer Performance Measures :MFLOPS
• Program-dependent: Different programs have different
percentages of floating-point operations present. i.e
compilers have no floating- point operations and yield a
MFLOPS rating of zero.
• Dependent on the type of floating-point operations
present in the program.
– Peak MFLOPS rating for a CPU: Obtained using a program
comprised entirely of the simplest floating point
instructions (with the lowest CPI) for the given CPU design
which does not represent real floating point programs.

45
Quantitative Principles of Computer Design
• Amdahl’s Law:
– The performance gain from improving some portion of
a computer is calculated by:

Speedup = Performance for entire task using the enhancement

Performance for the entire task without using the enhancement

or Speedup = Execution time without the enhancement

Execution time for entire task using the enhancement

46
Performance Enhancement Calculations:
Amdahl's Law
• The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is
used
• Amdahl’s Law:
– Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = ------------------------------------ = ------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = -----------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S

F (Fraction of execution time enhanced) refers

to original execution time before the enhancement is applied 47
Pictorial Depiction of Amdahl’s Law
Enhancement E accelerates fraction F of original execution time by a factor of S

Before:
Execution Time without enhancement E: (Before enhancement is applied)
• shown normalized to 1 = (1-F) + F =1

Unaffected fraction: (1- F) Affected fraction: F

Unchanged

Unaffected fraction: (1- F) F/S

After:
Execution Time with enhancement E:

Execution Time without enhancement E 1

Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
48
Example of Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Speedupoverall = 1 = 1.053
0.95

49
Performance Enhancement Example
• For the RISC machine with the following instruction mix given
earlier:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23% CPI = 2.2
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
50
An Alternative Solution Using CPU Equation
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45% CPI = 2.2
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6

Original Execution Time Instruction count x old CPI x clock cycle

Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle

old CPI 2.2

= ------------ = --------- = 1.37
new CPI 1.6

Which is the same speedup obtained from Amdahl’s Law in the

first solution. 51
Performance Enhancement Example
• A program runs in 100 seconds on a machine with multiply
operations responsible for 80 seconds of this time. By how much
must the speed of multiplication be improved to make the program
four times faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
→ Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
→ 5 = 80 seconds / S
→ S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of
execution time:
F = 80/100 = .8
Solving for S gives S= 16
1 1 1
Speedup(E) = ------------------ = 4 = ----------------- = ---------------
(1 - F) + F/S (1 - .8) + .8/S .2 + .8/s
and then solving Amdahl’s speedup equation for desired enhancement factor
S Hence multiplication should be 16 times faster to get an overall speedup
of 4. 52
Performance Enhancement Example
• For the previous example with a program running in 100 seconds on
a machine with multiply operations responsible for 80 seconds of
this time. By how much must the speed of multiplication be
improved to make the program five times faster?

100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement

→ Execution time with enhancement = 100/5 = 20 seconds

20 seconds = (100 - 80 seconds) + 80 seconds / s

20 seconds = 20 seconds + 80 seconds / s
→ 0 = 80 seconds / s

No amount of multiplication speed improvement can achieve this.

53
Extending Amdahl's Law To Multiple
Enhancements
• Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the time
is unaffected then:

Original Execution Time

Speedup =
((1 −  F ) +  F ) XOriginal Execution Time
i i i
i

Unaffected fraction
S i

1
Speedup =
((1 −  F ) +  F )
i i i
i

S i

Note: All fractions Fi refer to original execution time before the

enhancements are applied.
54
Amdahl's Law With Multiple Enhancements:
Example
• Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:

Speedup1 = S1 = 10 Percentage1 = F1 = 20%

Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%

• While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
1
Speedup =
((1 −  F ) +  F )
i i i
i

S i

• Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]

= 1/ [ .55 + .0333 ]
= 1 / .5833 = 1.71
55
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
S1 = 10 S2 = 15 S3 = 30

Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1

/ 10 / 15 / 30

Unchanged

Unaffected, fraction: .55

After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833

Speedup = 1 / .5833 = 1.71

Note: All fractions refer to original execution time.

56
Performance and Power
• Power is a key limitation
– Battery capacity has improved only slightly over time
• Need to design power-efficient processors
• Reduce power by
– Reducing frequency
– Reducing voltage
– Putting components to sleep
• Energy efficiency
– Important metric for power-limited applications
– Defined as performance divided by power consumption
Dynamic Energy and Power
• Dynamic energy
– Transistor switch from 0 -> 1 or 1 -> 0
– ½ x Capacitive load x Voltage2

• Dynamic power
– ½ x Capacitive load x Voltage2 x Frequency switched

• Reducing clock rate reduces power, not energy

Power & Clock Rate
Performance and Power
1 .6
P e ntiu m M @ 1 .6 /0 .6 G H z
P e ntiu m 4 -M @ 2 .4 /1 .2 G H z
1 .4
P e ntiu m III- M @ 1 .2 /0 .8 G H z
Relative Performance

1 .2

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0
S P E C IN T 2 0 00 S P E C F P2 0 00 S P E C IN T 200 0 S P E C F P 2 000 S P E C IN T 2 00 0 S P E C FP 2 0 0 0

Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock

Benchmark and Power Mode

Energy Efficiency
Pentium M @ 1.6/0.6 GHz
Pentium 4-M @ 2.4/1.2 GHz
Relative Energy Efficiency

Pentium III-M @ 1.2/0.8 GHz

Energy efficiency of the Pentium M is

highest for the SPEC2000 benchmarks

SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000

Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock

Benchmark and power mode

Chip Manufacturing Process
Silicon ingot Blank wafers

Slicer Hundreds of Steps

30 cm 1 mm thick
diameter
Tested dies Individual dies Patterned wafer

Die
Dicer
Tester

Packaged dies Tested Packaged

dies
Bond die to Part Ship to
package Tester Customers
Effect of Die Size on Yield
Good Die

Defective Die

120 dies, 109 good 26 dies, 15 good

Dramatic decrease in yield with larger dies

Yield = (Number of Good Dies) / (Total Number of Dies)

1
Yield =
(1 + (Defect per area  Die area / 2))2

Die Cost = (Wafer Cost) / (Dies per Wafer  Yield)

Integrated Circuit Cost
• Integrated circuit

Yield = (Number of Good Dies) / (Total Number of Dies)

1
Yield =
(1 + (Defect per area  Die area / 2))2
Things to Remember
• Performance is specific to a particular program
– Any measure of performance should reflect execution time
– Total execution time is a consistent summary of
performance
• For a given ISA, performance improvements come
from
– Increases in clock rate (without increasing the CPI)
– Improvements in processor organization that lower CPI
– Compiler enhancements that lower CPI and/or instruction
count
– Algorithm/Language choices that affect instruction count
• Pitfalls (things you should avoid)
– Using a subset of the performance equation as a metric
– Expecting improvement of one aspect of a computer to
increase performance proportional to the size of
improvement
Example
You are going to enhance a machine and there are
two types of possible improvements: either
• make multiply instructions run 4 times faster, or
• make memory access instructions run two times
faster than before.
You repeatedly run a program that takes 100
seconds to execute (on the original machine) and
find that of this time 25% is used for multiplication,
50% for memory access instructions, and 25% for
other tasks.

66
Example
1. What will the speedup be if you improve both
multiplication and memory access?
2. Assume the program you run has 10 billions
instructions and runs on the machine that has a
clock rate of 1GHz. Calculate the CPI for this
machine. Assume further that the CPI for
multiplication instructions is 20 cycles and the CPI
for memory access instructions is 6 cycles. Compute
the CPI for all other instructions.
3. What is the CPI for the improved machine when
improvements on both multiplication and memory
access instructions are made?

Cs23402 - Computer Architecture - Unit - 1
No ratings yet
Cs23402 - Computer Architecture - Unit - 1
161 pages
Computer Science - Class 11 Notes PDF
No ratings yet
Computer Science - Class 11 Notes PDF
134 pages
Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
80% (5)
Computer Organization & Design The Hardware/Software Interface, 2nd Edition Patterson & Hennessy
118 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
59 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Lec 2
No ratings yet
Lec 2
31 pages
Abstraction & Technology - 1
No ratings yet
Abstraction & Technology - 1
74 pages
Chapter - 01 - Computer Abstractions
No ratings yet
Chapter - 01 - Computer Abstractions
37 pages
2 RISC V Performance ISA
No ratings yet
2 RISC V Performance ISA
72 pages
Chapter 01 RISC V
No ratings yet
Chapter 01 RISC V
30 pages
Computer Science & Its Application in Defence
No ratings yet
Computer Science & Its Application in Defence
200 pages
Lyla B Das 0 & 1 PDF
100% (4)
Lyla B Das 0 & 1 PDF
53 pages
Lecture 1 Computer Abstraction and Performance
No ratings yet
Lecture 1 Computer Abstraction and Performance
25 pages
CS5204/EE5364 - Advanced Computer Architecture - Performance
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Performance
56 pages
Unit 1
No ratings yet
Unit 1
68 pages
Chapter 01 Modified
No ratings yet
Chapter 01 Modified
55 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Module 2 (26-10-2024)
No ratings yet
Module 2 (26-10-2024)
50 pages
Lecture 02 CH01 Performance Power
No ratings yet
Lecture 02 CH01 Performance Power
76 pages
MIC Microproject
No ratings yet
MIC Microproject
30 pages
CH02-HP Computer Abstractions and Technology
No ratings yet
CH02-HP Computer Abstractions and Technology
36 pages
Computer Architecture Introduction
No ratings yet
Computer Architecture Introduction
61 pages
Chapter4 Performance
No ratings yet
Chapter4 Performance
36 pages
CHAPTER 1 and 2
No ratings yet
CHAPTER 1 and 2
25 pages
Chapter 1 PPT 2007 V 2
No ratings yet
Chapter 1 PPT 2007 V 2
36 pages
PPT#01
No ratings yet
PPT#01
30 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
52 pages
01 - Chapter 1
No ratings yet
01 - Chapter 1
41 pages
Lec 2
No ratings yet
Lec 2
31 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
CCS 1202 Lecture 2 - Computer Evolution and Performance
No ratings yet
CCS 1202 Lecture 2 - Computer Evolution and Performance
32 pages
Computer Architecture: Vnu - University Engineering Technology
No ratings yet
Computer Architecture: Vnu - University Engineering Technology
30 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
ARM Computer Organization-Chapter01
No ratings yet
ARM Computer Organization-Chapter01
55 pages
Designing For Performance - Performance Metrics
No ratings yet
Designing For Performance - Performance Metrics
19 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
CCE 131 Lecture1
No ratings yet
CCE 131 Lecture1
26 pages
Computer Performance
No ratings yet
Computer Performance
18 pages
Da Ci
No ratings yet
Da Ci
13 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
Fundamentals of Computer Design Unit 1-Chapter 1: Reference
No ratings yet
Fundamentals of Computer Design Unit 1-Chapter 1: Reference
53 pages
Performance Issues
No ratings yet
Performance Issues
19 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
Hritariddha Acharjee1 11.23.32 AM
No ratings yet
Hritariddha Acharjee1 11.23.32 AM
12 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
Aula Ch1
No ratings yet
Aula Ch1
40 pages
CSE 332 L4 - 14 Nov 2020
No ratings yet
CSE 332 L4 - 14 Nov 2020
41 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
Lect 1
No ratings yet
Lect 1
56 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
17 pages
Lect 1
No ratings yet
Lect 1
54 pages
Css Module Uc 1
No ratings yet
Css Module Uc 1
111 pages
Computer Architecture: Fundamentals
No ratings yet
Computer Architecture: Fundamentals
36 pages
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
No ratings yet
CIS775: Computer Architecture: Chapter 1: Fundamentals of Computer Design
43 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
Advanced Computer Architecture: 563 L02.1 Fall 2011
No ratings yet
Advanced Computer Architecture: 563 L02.1 Fall 2011
57 pages
Ico22 - 1 - Computer Abstraction and Technology
No ratings yet
Ico22 - 1 - Computer Abstraction and Technology
42 pages
Computer Architecture: Fundamentals Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Fundamentals Prof. Jerry Breecher CSCI 240 Fall 2003
36 pages
IT401 Computer Organization and Architecture: Prasun Ghosal
No ratings yet
IT401 Computer Organization and Architecture: Prasun Ghosal
30 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Lecture 16 Technology, Performance, Powerwall
No ratings yet
Lecture 16 Technology, Performance, Powerwall
9 pages
Lecture 1 8405 Computer Architecture
No ratings yet
Lecture 1 8405 Computer Architecture
15 pages
GFZ62994 en
No ratings yet
GFZ62994 en
463 pages
Clock Speed
No ratings yet
Clock Speed
3 pages
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
From Everand
Programming and Prototyping with Teensy Microcontrollers: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
5 Hardware Design Languages
No ratings yet
5 Hardware Design Languages
65 pages
4 Verification Cycle
No ratings yet
4 Verification Cycle
41 pages
Unit 1
No ratings yet
Unit 1
6 pages
Intel: Microprocessors
No ratings yet
Intel: Microprocessors
42 pages
Module-2 - Lecture 2: Alu - Signed Addition/Subtraction
No ratings yet
Module-2 - Lecture 2: Alu - Signed Addition/Subtraction
44 pages
MIS 6110 Assignment #1 (Spring 2015)
No ratings yet
MIS 6110 Assignment #1 (Spring 2015)
14 pages
3 Verification Tools and Directed Testing
No ratings yet
3 Verification Tools and Directed Testing
33 pages
5403 Basics of ICT
No ratings yet
5403 Basics of ICT
9 pages
01.evolution of Microprocessors
No ratings yet
01.evolution of Microprocessors
32 pages
Practical Project For Level Ii
No ratings yet
Practical Project For Level Ii
11 pages
Sheet 1
No ratings yet
Sheet 1
2 pages
What Do You Know About CPU? How Does It Work?: Topic 13. Central Processing Unit (Cpu)
No ratings yet
What Do You Know About CPU? How Does It Work?: Topic 13. Central Processing Unit (Cpu)
7 pages
An Introduction To Computer Architecture
No ratings yet
An Introduction To Computer Architecture
59 pages
Cambridge O Level: Computer Science 2210/12
No ratings yet
Cambridge O Level: Computer Science 2210/12
12 pages
Final Examinationl-Nguyễn Hoàng Long - BI11-157
No ratings yet
Final Examinationl-Nguyễn Hoàng Long - BI11-157
6 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
The Performance Equation
No ratings yet
The Performance Equation
4 pages
Computer Hardware Note
No ratings yet
Computer Hardware Note
50 pages
The Control Unit: The Control Unit Manages Four Basic Operations (Fetch, Decode, Execute, and Write-Back)
No ratings yet
The Control Unit: The Control Unit Manages Four Basic Operations (Fetch, Decode, Execute, and Write-Back)
7 pages
Dependency Injection
No ratings yet
Dependency Injection
2 pages
AMD Dragon AM3 AM2 Performance Tuning Guide
No ratings yet
AMD Dragon AM3 AM2 Performance Tuning Guide
19 pages
2.embedded Microcontrollers
No ratings yet
2.embedded Microcontrollers
42 pages
Exp - 08 Flight86
No ratings yet
Exp - 08 Flight86
6 pages
Sri Vidya College of Engineering & Technology, Virudhunagar: Designed For Individual Use
No ratings yet
Sri Vidya College of Engineering & Technology, Virudhunagar: Designed For Individual Use
30 pages
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
No ratings yet
Unsupervised Video Summarization Framework Using Keyframe Extraction and Video Skimming
6 pages
Difference Between Static RAM and Dynamic RAM
No ratings yet
Difference Between Static RAM and Dynamic RAM
8 pages
Answersforpy
No ratings yet
Answersforpy
6 pages
Csci 260 Study Guide-2
No ratings yet
Csci 260 Study Guide-2
10 pages
Bosch Ip GTM
No ratings yet
Bosch Ip GTM
2 pages