4 Performance
4 Performance
Performance
Aziz Qaroush
Review: Computer System Components
CPU Core
1 GHz - 3.8 GHz
4-way Superscaler
All Non-blocking caches
RISC or RISC-core (x86):
Deep Instruction Pipelines
L1 16-128K 1-2 way set associative (on chip), separate or unified
Dynamic scheduling L1 L2 256K- 2M 4-32 way set associative (on chip) unified
Multiple FP, integer FUs CPU L3 2-16M 8-32 way set associative (off or on chip) unified
Dynamic branch prediction L2
Hardware speculation Examples: Alpha, AMD K7: EV6, 200-400 MHz
L3 Intel PII, PIII: GTL+ 133 MHz
SDRAM Caches Intel P4 800 MHz
PC100/PC133
100-133MHZ Front Side Bus (FSB)
64-128 bits wide
2-way inteleaved Off or On-chip
~ 900 MBYTES/SEC )64bit) Memory
adapters I/O Buses
Current Standard Controller Example: PCI, 33-66MHz
32-64 bits wide
Double Date 133-528 MBYTES/SEC
Rate (DDR) SDRAM Memory Bus
PC3200
Controllers NICs PCI-X 133MHz 64 bit
1024 MBYTES/SEC
200 MHZ DDR
64-128 bits wide Memory
4-way interleaved Disks
~3.2 GBYTES/SEC
(one 64bit channel)
Displays Networks
~6.4 GBYTES/SEC Keyboards
(two 64bit channels)
I/O Chan
Link
ISA
API
Interfaces
Technology
IR
Regs
Machine Organization
Computer
Applications
Architect
Measurement &
Evaluation
4
The Architecture Process
Estimate
Cost & Sort
Performance
New concepts
created
Good
Mediocre ideas
Bad ideas
ideas
5
What is Performance?
• How can we make intelligent choices about computers?
7
Performance Measurement and Evaluation
• Many dimensions to
computer performance P
– CPU execution time
• by instruction or sequence
– floating point
– integer C
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance M
• bandwidth
• seeks
• pixels or polygons per
second
• Relative importance
depends on applications
8
Evaluation Tools
• Benchmarks, traces, & mixes
– macrobenchmarks & suites
MOVE 39%
• application execution time BR 20%
– microbenchmarks LOAD 20%
• measure one aspect of STORE 10%
performance ALU 11%
– traces
• replay recorded accesses
– cache, branch, register
• Simulation at many levels
– ISA, cycle accurate, RTL, gate,
circuit
• trade fidelity for simulation rate
• Area and delay estimation
• Analysis
– e.g., queuing theory
– Fundamentals Laws
9
Metrics of Computer Performance
10
Benchmarks and Benchmarking
12
Choosing Programs To Evaluate Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the target
machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical of targeted
applications or workload (e.g SPEC95, SPEC CPU2000).
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific aspect of
performance characteristics: Processing: integer, floating point, local
memory, input/output, etc.
13
Types of Benchmarks
Pros Cons
• Very specific.
• Representative Actual Target Workload • Non-portable.
• Complex: Difficult
to run, or measure.
• Portable.
• Widely used. • Less representative
Full Application Benchmarks
• Measurements than actual workload.
useful in reality.
• Peak performance
• Identify peak results may be a long
performance and Microbenchmarks
way from real application
potential bottlenecks. performance
SPEC: System Performance Evaluation
Cooperative
The most popular and industry-standard set of CPU
benchmarks.
• SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs).
• SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a score
of SPECint95 = SPECfp95 = 1
• SPEC CPU2000, 1999:
– CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive
programs)
– Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score
of SPECint2000 = SPECfp2000 = 100
15
Application Performance: Intel Core i9-
12900K vs Ryzen 9 5950X and Ryzen 9 5900X
16
Power Consumption, Efficiency, and Cooling: Intel
Core i9-12900K vs Ryzen 9 5950X and Ryzen 9 5900X
17
Architectural Performance Laws and Rules
of Thumb
• Measurement and Evaluation
– Architecture is an iterative process:
• Searching the space of possible designs
• Make selections
• Evaluate the selections made
– Good measurement tools are required to accurately evaluate the
selection.
• Measurement Tools
– Benchmarks, Traces, Mixes
– Cost, delay, area, power estimation
– Simulation (many levels)
• ISA, RTL, Gate, Circuit
– Queuing Theory
– Rules of Thumb
– Fundamental Laws
18
Time as a Measure of Performance
• Response Time
– Time between start and completion of a task
– Less time to run a task more tasks can be executed per unit of time
22
Computer Performance Measures: Program
Execution Time
• For a specific program compiled to run on a specific
machine (CPU) “A”, has the following parameters:
– The total executed instruction count of the program. I
– The average number of cycles per instruction (average CPI). CPI
– Clock cycle of machine “A” C
• How can one measure the performance of this machine
(CPU) running this program?
– Intuitively the machine (or CPU) is said to be faster or has
better performance running this program if the total execution
time is shorter.
– Thus the inverse of the total measured program execution
time is a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve
performance? 23
Comparing Computer Performance Using
Execution Time
• To compare the performance of two machines (or CPUs) “A”, “B” running a
given specific program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
• Machine A is n times faster than machine B means (or slower? if n < 1)
PerformanceA Execution TimeB
Speedup = n = =
PerformanceB Execution TimeA
T = I x CPI x C
Execution Time Number of Average CPI for program CPU Clock Cycle
per program in seconds instructions executed
25
(This equation is commonly known as the CPU performance equation)
CPU Average CPI/Execution Time
For a given program executed on a given machine (CPU):
27
Instruction = cycle?
• Is the number of cycles identical with the number of
instructions?
– No!
• The number of cycles depends on the implementation of the
operations in hardware
– The number differs for each processor
– Why?
• Operations take different time
– Multiplication takes longer than addition
– Floating point operations take longer than integer operations
• The access time to a register is much shorter than to memory
location
28
Aspects of CPU Execution Time
CPU Time = Instruction count x CPI x Clock cycle
Depends on:
T = I x CPI x C
Program Used
Compiler
ISA
Instruction Count I
(executed)
Depends on:
Depends on:
Program Used
CPI Clock CPU Organization
Compiler Cycle Technology (VLSI)
ISA (Average C
CPU Organization CPI)
29
Factors Affecting CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Instruction
CPI Clock Cycle C
Count I
Program X X
Compiler X X
Instruction Set
Architecture (ISA) X X
Organization X X
(CPU Design)
Technology X
(VLSI)
30
CPU Execution Time: Example
• A Program is running on a specific machine (CPU) with the
following parameters:
– Total executed instruction count: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds)
• What is the execution time for this program:
i =1
(CPI C )
n
CPU clock cycles = i i
CPI = CPU Cycles / I 34
i =1
Instruction Frequency & CPI
• Given a program with n types or classes of instructions with
the following characteristics:
CPI = (CPI i F i )
n
i =1
CPIi x Fi
Fraction of total execution time for instructions of type i =
CPI
35
Instruction Type Frequency & CPI:
A RISC Example
CPIi x Fi
Program Profile or Executed Instructions Mix
CPI
Base Machine (Reg / Reg)
Op Freq, Fi CPIi CPIi x Fi % Time
Given ALU 50% 1 .5 23% = .5/2.2
Load 20% 5 1.0 45% = 1/2.2
Store 10% 3 .3 14% = .3/2.2
Branch 20% 2 .4 18% = .4/2.2
Typical Mix
Sum = 2.2
CPI = (CPI i F i )
n
i =1
CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
= .5 + 1 + .3 + .4
36
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• For a specific program running on a specific CPU the MIPS rating is a
measure of how many millions of instructions are executed per
second:
MIPS Rating = Instruction count / (Execution Time x 106)
= Instruction count / (CPU clocks x Cycle time x 106)
= (Instruction count x Clock rate) / (Instruction count x CPI x 106)
= Clock rate / (CPI x 106)
• Major problem with MIPS rating: As shown above the MIPS rating
does not account for the count of instructions executed (I).
– A higher MIPS rating in many cases may not mean higher performance or
better execution time. i.e. due to compiler design variations.
• In addition the MIPS rating:
– Does not account for the instruction set architecture (ISA) used.
• Thus it cannot be used to compare computers/CPUs with different instruction
sets.
– Easy to abuse: Program used to get the MIPS rating is often omitted.
• Often the Peak MIPS rating is provided for a given CPU which is obtained
using a program comprised entirely of instructions with the lowest CPI for the
given CPU design which does not represent real programs. 37
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• Under what conditions can the MIPS rating be used to
compare performance of different CPUs?
• The MIPS rating is only valid to compare the performance of
different CPUs provided that the following conditions are
satisfied:
1 The same program is used
(actually this applies to all performance metrics)
2 The same ISA is used
3 The same compiler is used
(Thus the resulting programs used to run on the CPUs
and obtain the MIPS rating are identical at the machine
code level including the same instruction count)
38
Wrong!!!
• 3 significant problems with using MIPS:
– Problem 1:
• MIPS is instruction set dependent.
• (And different computer brands usually have different instruction
sets)
– Problem 2:
• MIPS varies between programs on the same computer
– Problem 3:
• MIPS can vary inversely to performance!
• Let’s look at an examples of why MIPS doesn’t work…
39
Compiler Variations, MIPS & Performance:
An Example
• For a machine (CPU) with instruction classes:
(CPI C )
n
CPU clock cycles = i i
i =1
CPU time = Instruction count x CPI / Clock rate
• For compiler 1:
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIPS Rating1 = 100 / (1.428 x 106) = 70.0 MIPS
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds
• For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIPS Rating2 = 100 / (1.25 x 106) = 80.0 MIPS
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds
(CPI C )
n
Instruction type CPI
2 CPU clock cycles = i i
i =1 ALU 4
Load 5
= 2001x4 + 1001x5 + 1000x7 + 1000x3 = 23009 cycles Store 7
3 Average CPI = CPU clock cycles / I = 23009/5002 = 4.6 Branch 3
4 Fraction of execution time for each instruction type:
– Fraction of time for ALU instructions = CPIALU x FALU / CPI= 4x0.4/4.6 = 0.348 = 34.8%
– Fraction of time for load instructions = CPIload x Fload / CPI= 5x0.2/4.6 = 0.217 = 21.7%
– Fraction of time for store instructions = CPIstore x Fstore / CPI= 7x0.2/4.6 = 0.304 = 30.4%
– Fraction of time for branch instructions = CPIbranch x Fbranch / CPI= 3x0.2/4.6 = 0.13 = 13%
5 Execution time = I x CPI x C = CPU cycles x C = 23009 x 2x10-9 =
= 4.6x 10-5 seconds = 0.046 msec = 46 usec
6 MIPS rating = Clock rate / (CPI x 106) = 500 / 4.6 = 108.7 MIPS
– The CPU achieves its peak MIPS rating when executing a program that only has instructions of the type with
the lowest CPI. In this case branches with CPIBranch = 3
– Peak MIPS rating = Clock rate / (CPIBranch x 106) = 500/3 = 166.67 MIPS 43
Computer Performance Measures :MFLOPS
• A floating-point operation is an addition, subtraction,
multiplication, or division operation applied to numbers
represented by a single or a double precision floating-point
representation.
• MFLOPS, for a specific program running on a specific
computer, is a measure of millions of floating point-operation
(megaflops) per second:
MFLOPS =
Number of floating-point operations / (Execution time x 106 )
44
Computer Performance Measures :MFLOPS
• Program-dependent: Different programs have different
percentages of floating-point operations present. i.e
compilers have no floating- point operations and yield a
MFLOPS rating of zero.
• Dependent on the type of floating-point operations
present in the program.
– Peak MFLOPS rating for a CPU: Obtained using a program
comprised entirely of the simplest floating point
instructions (with the lowest CPI) for the given CPU design
which does not represent real floating point programs.
45
Quantitative Principles of Computer Design
• Amdahl’s Law:
– The performance gain from improving some portion of
a computer is calculated by:
46
Performance Enhancement Calculations:
Amdahl's Law
• The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is
used
• Amdahl’s Law:
– Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = ------------------------------------ = ------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = -----------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S
Before:
Execution Time without enhancement E: (Before enhancement is applied)
• shown normalized to 1 = (1-F) + F =1
Unchanged
Speedupoverall = 1 = 1.053
0.95
49
Performance Enhancement Example
• For the RISC machine with the following instruction mix given
earlier:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23% CPI = 2.2
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
50
An Alternative Solution Using CPU Equation
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45% CPI = 2.2
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement
53
Extending Amdahl's Law To Multiple
Enhancements
• Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the time
is unaffected then:
Unaffected fraction
S i
1
Speedup =
((1 − F ) + F )
i i i
i
S i
• While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
1
Speedup =
((1 − F ) + F )
i i i
i
S i
/ 10 / 15 / 30
Unchanged
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
• Dynamic power
– ½ x Capacitive load x Voltage2 x Frequency switched
1 .2
1 .0
0 .8
0 .6
0 .4
0 .2
0 .0
S P E C IN T 2 0 00 S P E C F P2 0 00 S P E C IN T 200 0 S P E C F P 2 000 S P E C IN T 2 00 0 S P E C FP 2 0 0 0
Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock
SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000
Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock
30 cm 1 mm thick
diameter
Tested dies Individual dies Patterned wafer
Die
Dicer
Tester
Defective Die
66
Example
1. What will the speedup be if you improve both
multiplication and memory access?
2. Assume the program you run has 10 billions
instructions and runs on the machine that has a
clock rate of 1GHz. Calculate the CPI for this
machine. Assume further that the CPI for
multiplication instructions is 20 cycles and the CPI
for memory access instructions is 6 cycles. Compute
the CPI for all other instructions.
3. What is the CPI for the improved machine when
improvements on both multiplication and memory
access instructions are made?
67