Designing for
Performance
Raul Queiroz Feitosa
Objective
In this chapter we examine the most common
approach to assessing processor and
computer system performance”
Designing for Performance 2
Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks
Designing for Performance 3
Which one would you choose?
Intel Xeon Platinum 8458P AMD Ryzen Threadripper PRO 5975WX
Cache 39 MB Cache 128 MB
Freq.: 2.2 GHz Freq.: 3.6 GHz
26 Cores 32 Cores
Designing for Performance 4
What matters?
❑ Cost
❑ Size
❑ Reliability
❑ Security
❑ Power Consumption
❑ Performance
❑…
Designing for Performance 5
Main CPU operations
❑ Fetch instructions
❑ Decode instructions
❑ Load and Store data
❑ Logic and Arithmetic Operations
❑ Fixed-Point
❑ Floating-Point
Designing for Performance 6
Performance Factors
Clock frequency ( f ) – expressed in multiples of Hz
• Clock cycle - one increment, or pulse, of the clock.
• Clock period ( τ ) - the time between consecutive pulses.
• Duty cycle – the ratio of time a signal is high compared to the total time.
clock
𝜏 cycle
clock
generator CPU
actual clock
Designing for Performance 7
Performance Factors
Clock frequency
• Usually, multiple clock cycles are required per
instruction.
• The amount of work implied by one instruction varies
considerably.
• Pipelining gives simultaneous execution of instructions.
• So, the clock frequency is not the whole story!
Designing for Performance 8
Performance Factors
Instruction Execution Rate
• Expressed in Millions of Instructions (MIPS)
• Floating-Point Instructions (MFLOPS) per second.
• Heavily dependent on the instruction set, compiler
design, processor implementation, cache, and memory
hierarchy.
• So, Instruction Execution Rate is not the whole story!
Designing for Performance 9
Performance Factors
CPI – the average number of cycles per instruction
• CPIk - number of cycles per instruction of type k.
• Ik - number of machine instructions of type k executed by a
program.
• Ic - number of machine instructions executed by a program
𝑛
σ𝑛𝑘=1 𝐶𝑃𝐼𝑘 × 𝐼𝑘
𝐼𝑐 = 𝐼𝑘 𝐶𝑃𝐼 = 𝑓(𝑀ℎ𝑧) = 𝐶𝑃𝐼 ∗ 𝑀𝐼𝑃𝑆
𝑘=1
𝐼𝑐
Designing for Performance 10
Performance Factors
T – processor time needed to execute a program
.
T = I c CPI
A refinement yields
𝑇 = 𝐼𝑐 × 𝑝 + (𝑚 × 𝐾) × 𝜏
where
p is the number of processor cycles to decode + execute the instruction
m is the number of memory references needed
K is the ratio between memory cycle time and processor cycle time.
Designing for Performance 11
Review Question 1
System attributes affecting the performance factors
Ic p m K τ
Instruction set architecture ✓ ✓ !
Compiler technology ✓ ✓ ✓
Processor implementation ✓ ✓
Cache and memory hierarchy ✓ ✓
• Ic is the total number of executed instructions
• p is the number of cycles for processor internal operations
• m is the number of memory references needed
• k is the ratio between memory cycle time and processor cycle time.
• τ is the clock period. Designing for Performance 12
Review Question 2
Consider two codes produced by two compilers for the same source program. The instructions of
the machine that will execute these codes can be divided into classes A (CPI=1) and B (CPI=2).
The number of executed instructions for each class are:
Class compiler 1 compiler 2 comments
A 600M 400M CPI=1
B 400M 400M CPI=2
a) Compute the execution time for both codes assuming a clock rate = 1 GHz.
𝑇1 = (600 × 1 + 400 × 2)106 Τ109 =1.4s
𝑇2 = (400 × 1 + 400 × 2)106 Τ109 =1.2s
b) Which compiler produces the most efficient code and by which factor?
The compiler 2 was 1,4/1,2=1,17 times more efficient than compiler 1
c) Which code executes at the highest MIPS?
𝐶𝑃𝐼1 = (600 × 1 + 400 × 2)106 Τ(1000𝑥106 ) = 1,4 cloks/instruction
𝐶𝑃𝐼2 = (400 × 1 + 400 × 2)106 Τ(800𝑥106 ) = 1,5 cloks/instruction
1000 800
Therefore, 𝑀𝐼𝑃𝑆1 = = 714 and 𝑀𝐼𝑃𝑆2 = = 667
1.4 1.2
Designing for Performance 13
Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Homeworks
◼ Benchmarks
Designing for Performance 14
Amdahl’s Law
potential speed-up of the program using multiple processors
T is the total execution time for the program on a single processor
Fraction (1-f) of code inherently serial
Fraction f of code parallelizable with no scheduling overhead
N is the number of processors that fully exploit parallel portions of code
𝑇
single processor 𝑇(1 − 𝑓) 𝑇𝑓
𝑇𝑓
N parallel 𝑇(1 − 𝑓) 𝑁
processors
time to execute program on a single processor 𝑇 1 − 𝑓 + 𝑇𝑓 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = =
time to execute programa on 𝑁 parallel processors 𝑇𝑓 𝑓
𝑇 1−𝑓 + 1−𝑓 +
𝑁 𝑁
Designing for Performance 15
Amdahl’s Law
potential speed-up of the program using multiple processors
Performance gain conditioned to parallelizable code!
If f small, adding processors has little effect.
N → ∞, speedup bound by 1/(1 – f).
diminishing returns for more processors.
1
𝑓
1−𝑓 +
𝑁
Designing for Performance 16
Amdahl’s Law
in practice
Parallel programs introduce an overhead due to coordination
and synchronization, not present in their sequential
counterparts.
𝑇
single processor 𝑇(1 − 𝑓) 𝑇𝑓
𝑇𝑓
N parallel 𝑇(1 − 𝑓) 𝑁
𝑜
processors
So, the actual speed-up becomes
𝑇 𝑇1 −
1−𝑓 𝑓+ +
𝑇𝑓𝑇𝑓 1 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = =
𝑇𝑓 𝑓 𝑓
𝑇 1−𝑓 + + 𝑜1 − 1
𝑓 −+𝑓 + + 𝑜
𝑁 𝑁
𝑁
Designing for Performance 17
Review Question 3
A program spends 60% of its execution time with floating point operations. 90% of
them are executed in parallelizable loops. When the code is parallelized coordination
and synchronization between parts make the part not involving floating-point
operations 10% longer.
a) Find the improvement in terms of execution time achieved by doubling the speed of
the floating-point unit.
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.43
0.6
2 + 0.4
b) Find the improvement in terms of execution time achieved by using two processors
having the same speed and structure as the original one
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.30
0.6 ∗ 0.9
+ 0.6 ∗ 0.1 + 1.1 ∗ 0.4
2
c) What would be the improvement if both changes are implemented?
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.65
0.6 ∗ 0.9 0.6 ∗ 0.1
+ + 1.1 ∗ 0.4
4 2
Designing for Performance 18
Amdahl’s Law
Generalization for any design improvement
Execution time before enhancement
Speedup = .
Execution time after enhancement
Suppose that the enhancement affects the execution f of the
total runtime before enhancement, and that the speed up
brought by this enhancement is SUf . Thus
1
Speedup =
(1 − f ) + f
SU f
Designing for Performance 19
Amdahl’s Law
Generalization for any design improvement
Example:
Suppose that a task consumes 40% of the time with
floating-point operations. A new FPU has speedup
K. Then the overall speedup is
1
Speedup =
(1 − 0.4) + 0.4
K
So, the maximum speedup is 1.67.
Designing for Performance 20
Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks
Designing for Performance 21
Benchmarks
Motivation
A high-level language statement
A=B+C /* assume all quantities in main memory */
Compiled code on RISC
load mem(B),reg(1);
Compiled code on CISC load mem(C),reg(2);
add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3);
store reg(3),mem(A);
Assume that both machines take the same time to run the same
high-level code.
So, if MIPSCISC= 1, then MIPSRISC= 4
Designing for Performance 22
Benchmarks
Definition
Programs designed to test performance
Written in high-level language → portable
Represents a particular application or system programming
area (scientific, commercial)
Easily measured and widely distributed
The best-known such collection of benchmark suites is the
System Performance Evaluation Corporation (SPEC)
The best-known of the SPEC suites is the CPU2017:
◼ contains 43 benchmarks organized into four suites
◼ includes an optional metric for measuring energy
consumption
Designing for Performance 23
System Performance Evaluation Corporation
(SPEC)
Designing for Performance 24
Benchmarks
SPECspeed metric
Spec benchmarks do not concern with instruction execution
rates
Base runtime defined for each benchmark using a reference
machine
Speed metric is the ratio of reference time to system run time
◼ Trefi execution time for benchmark i on reference machine
◼ Tsuti execution time of benchmark i on a test system
Designing for Performance 25
Benchmarks
Averaging SPEC metrics
For the SPECspeed, the selected ratios are averaged using the
Geometric Mean, which is reported as the overall metric.
Designing for Performance 26
Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks
Designing for Performance 27
Homework 1
A program involves the execution of 2 million instructions on a 400 MHz
processor. CPI and the proportion of four instruction types are given below.
Compute the average CPI:
instruction type CPI instruction mix
Arithmetic and logic 1 60%
Load/store with cache hit 2 18%
Branch 4 12%
Load/store with cache miss 8 10%
Answer:
average CPI is CPI = 0.6+ (2 0.18) + (4 0.12) + (8 0.1) = 2.24
Designing for Performance 28
Homework 2
Consider two hardware implementations M1 and M2 of the same instruction set.
There are three instruction classes: F, I and N. The M1 clock rate is 600 Mhz.
The clock cycle of M2 is 2 ns. The average CPI for these three instruction classes
are
Class CPI of M1 CPI of M2 Comments
F 5.0 4.0 floating-point
I 2.0 3.8 integer
N 2.4 2.0 non-arithmetic
a) Compute the peak performance for M1 and M2 in MIPS.
b) If 50% of the instruction executed in a given program belong to class N and
the other are equally distributed between F and I, which is the fastest
machine and by which factor?
Designing for Performance 29
Homework 2
c) A designer of M1 plan to change the project to improve performance.
Assuming the information in (b). Which of the options below should be
more beneficial?
1. Use a FPU twice as fast (CPI=2,5 for class F).
2. Add a second ALU to reduce the CPI for integer operations to 1.20
3. Use a faster logic that allows a clock rate of 750 MHz keeping the same
CPI values?
d) The CPI given above include a cache miss that occurs 5 times per 100
executed instructions. Each cache miss imply in a 10 cycles penalty. The
forth redesign option consists of using a larger instruction cache so as to
reduce the miss ratio from 5% to 3%. Compare this alternative with the
options before.
e) Characterize application programs that can be executed faster in M1 than in
M2, i. e., discuss the instruction composition of such applications. Hint: Let
x, y and 1-x-y the fraction of instructions belonging to classes F, I and N
respectively.
Designing for Performance 30
Homework 3
A processor is used for an application where 30 %, 25% and 10% of the
processing time is spent with floating-point addition, multiplication and division,
respectively. For a new processor version, 3 alternatives are being considered, all
of them involving nearly the same design and implementation cost. Which one
should be selected?
a) Redesign the adder making it twice as fast as the older one.
b) Redesign the multiplier making it three times as fast as the older one
c) Redesign the divider making it ten times as fast as the older one.
Designing for Performance 31
Homework 4
T is the average processing time of a computer operating at frequency f.
Instructions are grouped in 3 types, as shown below.
Instruction type CPI
Floating point arithmetic 10
Integer arithmetic 5
Non- arithmetic 2
Typically a program executes the same proportion of instructions from all three
groups/types. Compute the MIPS and the new execution time, if the FPU
becomes twice as fast.
Designing for Performance 32
Homework 5
Let f1 and f2 be the operation frequency of processors P1 and P2 respectively.
Assume that two compilers generate different executable codes for the same
source program which may be executed byP1 as well as byP2 . The codes have
the characteristics given below:
Proportion Proportion
Instruction type CPI
compiler 1 compiler 2
Floating point arithmetic 10 20 % 30 %
Integer arithmetic 5 30 % 10 %
Non- arithmetic 2 50 % 60 %
Compute the ratio f1/f2 for which the processing time in P1 executing code 1
equals the processing time of P2 executing code 2.
Designing for Performance 33
Homework 6
The code of an application can be separated in a sequential part (S) and in a
parallelizable part (P). The number of executed instructions of type P is twice as many as
of type S, when the application runs in a single processor. When the application runs in
multiple processors the number of instructions of type S increases in 10%. Consider the
following two configurations:
A) Single processor machine operating with frequency 2f.
B) Four processors machine operating with frequency f.
a) Determine the limit ratio r between the CPI of instructions of type P and type S
(r=CPIP /CPIS), for which the configuration A) is faster than configuration B).
b) Compute the upper limit for the speed up that can be achieved using multiple processors
without changing the operation frequency.
Designing for Performance 34
Designing for Performance
END
15-17, 24,28,31-25
Designing for Performance 35