0% found this document useful (0 votes)
46 views

Computer Performance

This document discusses approaches for assessing computer system performance. It outlines factors that affect performance such as clock frequency, instructions per second, cycles per instruction, and memory access time. Amdahl's Law is introduced, which states that potential speedup from parallelization is limited by the fraction of the program that can be parallelized. The more serial portions limit scaling. Benchmarks and homeworks are also mentioned as ways to evaluate performance.

Uploaded by

Alejandra Solano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Computer Performance

This document discusses approaches for assessing computer system performance. It outlines factors that affect performance such as clock frequency, instructions per second, cycles per instruction, and memory access time. Amdahl's Law is introduced, which states that potential speedup from parallelization is limited by the fraction of the program that can be parallelized. The more serial portions limit scaling. Benchmarks and homeworks are also mentioned as ways to evaluate performance.

Uploaded by

Alejandra Solano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Designing for

Performance

Raul Queiroz Feitosa


Objective

In this chapter we examine the most common


approach to assessing processor and
computer system performance”

Designing for Performance 2


Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks

Designing for Performance 3


Which one would you choose?

Intel Xeon Platinum 8458P AMD Ryzen Threadripper PRO 5975WX


Cache 39 MB Cache 128 MB
Freq.: 2.2 GHz Freq.: 3.6 GHz
26 Cores 32 Cores

Designing for Performance 4


What matters?
❑ Cost
❑ Size
❑ Reliability
❑ Security
❑ Power Consumption
❑ Performance
❑…

Designing for Performance 5


Main CPU operations
❑ Fetch instructions
❑ Decode instructions
❑ Load and Store data
❑ Logic and Arithmetic Operations
❑ Fixed-Point

❑ Floating-Point

Designing for Performance 6


Performance Factors
Clock frequency ( f ) – expressed in multiples of Hz
• Clock cycle - one increment, or pulse, of the clock.
• Clock period ( τ ) - the time between consecutive pulses.
• Duty cycle – the ratio of time a signal is high compared to the total time.

clock
𝜏 cycle

clock
generator CPU

actual clock

Designing for Performance 7


Performance Factors
Clock frequency

• Usually, multiple clock cycles are required per


instruction.
• The amount of work implied by one instruction varies
considerably.
• Pipelining gives simultaneous execution of instructions.
• So, the clock frequency is not the whole story!

Designing for Performance 8


Performance Factors
Instruction Execution Rate

• Expressed in Millions of Instructions (MIPS)


• Floating-Point Instructions (MFLOPS) per second.
• Heavily dependent on the instruction set, compiler
design, processor implementation, cache, and memory
hierarchy.
• So, Instruction Execution Rate is not the whole story!

Designing for Performance 9


Performance Factors
CPI – the average number of cycles per instruction

• CPIk - number of cycles per instruction of type k.


• Ik - number of machine instructions of type k executed by a
program.
• Ic - number of machine instructions executed by a program

𝑛
σ𝑛𝑘=1 𝐶𝑃𝐼𝑘 × 𝐼𝑘
𝐼𝑐 = ෍ 𝐼𝑘 𝐶𝑃𝐼 = 𝑓(𝑀ℎ𝑧) = 𝐶𝑃𝐼 ∗ 𝑀𝐼𝑃𝑆
𝑘=1
𝐼𝑐

Designing for Performance 10


Performance Factors
T – processor time needed to execute a program
.

T = I c  CPI  

A refinement yields
𝑇 = 𝐼𝑐 × 𝑝 + (𝑚 × 𝐾) × 𝜏
where
p is the number of processor cycles to decode + execute the instruction
m is the number of memory references needed
K is the ratio between memory cycle time and processor cycle time.

Designing for Performance 11


Review Question 1
System attributes affecting the performance factors

Ic p m K τ
Instruction set architecture ✓ ✓ !

Compiler technology ✓ ✓ ✓

Processor implementation ✓ ✓

Cache and memory hierarchy ✓ ✓

• Ic is the total number of executed instructions


• p is the number of cycles for processor internal operations
• m is the number of memory references needed
• k is the ratio between memory cycle time and processor cycle time.
• τ is the clock period. Designing for Performance 12
Review Question 2
Consider two codes produced by two compilers for the same source program. The instructions of
the machine that will execute these codes can be divided into classes A (CPI=1) and B (CPI=2).
The number of executed instructions for each class are:
Class compiler 1 compiler 2 comments
A 600M 400M CPI=1
B 400M 400M CPI=2
a) Compute the execution time for both codes assuming a clock rate = 1 GHz.
𝑇1 = (600 × 1 + 400 × 2)106 Τ109 =1.4s
𝑇2 = (400 × 1 + 400 × 2)106 Τ109 =1.2s

b) Which compiler produces the most efficient code and by which factor?
The compiler 2 was 1,4/1,2=1,17 times more efficient than compiler 1

c) Which code executes at the highest MIPS?


𝐶𝑃𝐼1 = (600 × 1 + 400 × 2)106 Τ(1000𝑥106 ) = 1,4 cloks/instruction
𝐶𝑃𝐼2 = (400 × 1 + 400 × 2)106 Τ(800𝑥106 ) = 1,5 cloks/instruction
1000 800
Therefore, 𝑀𝐼𝑃𝑆1 = = 714 and 𝑀𝐼𝑃𝑆2 = = 667
1.4 1.2

Designing for Performance 13


Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Homeworks
◼ Benchmarks

Designing for Performance 14


Amdahl’s Law
potential speed-up of the program using multiple processors

 T is the total execution time for the program on a single processor


 Fraction (1-f) of code inherently serial
 Fraction f of code parallelizable with no scheduling overhead
 N is the number of processors that fully exploit parallel portions of code

𝑇
single processor 𝑇(1 − 𝑓) 𝑇𝑓

𝑇𝑓
N parallel 𝑇(1 − 𝑓) 𝑁
processors

time to execute program on a single processor 𝑇 1 − 𝑓 + 𝑇𝑓 1


𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = =
time to execute programa on 𝑁 parallel processors 𝑇𝑓 𝑓
𝑇 1−𝑓 + 1−𝑓 +
𝑁 𝑁

Designing for Performance 15


Amdahl’s Law
potential speed-up of the program using multiple processors

 Performance gain conditioned to parallelizable code!


 If f small, adding processors has little effect.
 N → ∞, speedup bound by 1/(1 – f).
 diminishing returns for more processors.

1
𝑓
1−𝑓 +
𝑁

Designing for Performance 16


Amdahl’s Law
in practice
Parallel programs introduce an overhead due to coordination
and synchronization, not present in their sequential
counterparts.
𝑇
single processor 𝑇(1 − 𝑓) 𝑇𝑓

𝑇𝑓
N parallel 𝑇(1 − 𝑓) 𝑁
𝑜

processors

So, the actual speed-up becomes


𝑇 𝑇1 −
1−𝑓 𝑓+ +
𝑇𝑓𝑇𝑓 1 1
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = =
𝑇𝑓 𝑓 𝑓
𝑇 1−𝑓 + + 𝑜1 − 1
𝑓 −+𝑓 + + 𝑜
𝑁 𝑁
𝑁

Designing for Performance 17


Review Question 3
A program spends 60% of its execution time with floating point operations. 90% of
them are executed in parallelizable loops. When the code is parallelized coordination
and synchronization between parts make the part not involving floating-point
operations 10% longer.
a) Find the improvement in terms of execution time achieved by doubling the speed of
the floating-point unit.
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.43
0.6
2 + 0.4
b) Find the improvement in terms of execution time achieved by using two processors
having the same speed and structure as the original one
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.30
0.6 ∗ 0.9
+ 0.6 ∗ 0.1 + 1.1 ∗ 0.4
2
c) What would be the improvement if both changes are implemented?
1
𝑠𝑝𝑒𝑒𝑑𝑢𝑝 = = 1.65
0.6 ∗ 0.9 0.6 ∗ 0.1
+ + 1.1 ∗ 0.4
4 2
Designing for Performance 18
Amdahl’s Law
Generalization for any design improvement

Execution time before enhancement


Speedup = .
Execution time after enhancement

Suppose that the enhancement affects the execution f of the


total runtime before enhancement, and that the speed up
brought by this enhancement is SUf . Thus

1
Speedup =
(1 − f ) + f
SU f

Designing for Performance 19


Amdahl’s Law
Generalization for any design improvement

Example:
Suppose that a task consumes 40% of the time with
floating-point operations. A new FPU has speedup
K. Then the overall speedup is
1
Speedup =
(1 − 0.4) + 0.4
K
So, the maximum speedup is 1.67.

Designing for Performance 20


Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks

Designing for Performance 21


Benchmarks
Motivation

A high-level language statement

A=B+C /* assume all quantities in main memory */

Compiled code on RISC


load mem(B),reg(1);
Compiled code on CISC load mem(C),reg(2);
add mem(B),mem(C),mem(A) add reg(1),reg(2),reg(3);
store reg(3),mem(A);

Assume that both machines take the same time to run the same
high-level code.
So, if MIPSCISC= 1, then MIPSRISC= 4
Designing for Performance 22
Benchmarks
Definition
 Programs designed to test performance
 Written in high-level language → portable
 Represents a particular application or system programming
area (scientific, commercial)
 Easily measured and widely distributed
 The best-known such collection of benchmark suites is the
System Performance Evaluation Corporation (SPEC)
 The best-known of the SPEC suites is the CPU2017:
◼ contains 43 benchmarks organized into four suites
◼ includes an optional metric for measuring energy
consumption
Designing for Performance 23
System Performance Evaluation Corporation
(SPEC)

Designing for Performance 24


Benchmarks
SPECspeed metric

 Spec benchmarks do not concern with instruction execution


rates
 Base runtime defined for each benchmark using a reference
machine
 Speed metric is the ratio of reference time to system run time
◼ Trefi execution time for benchmark i on reference machine
◼ Tsuti execution time of benchmark i on a test system

Designing for Performance 25


Benchmarks
Averaging SPEC metrics

For the SPECspeed, the selected ratios are averaged using the
Geometric Mean, which is reported as the overall metric.

Designing for Performance 26


Outline
◼ Performance Assessment
◼ Amdahl’s Law
◼ Benchmarks
◼ Homeworks

Designing for Performance 27


Homework 1
A program involves the execution of 2 million instructions on a 400 MHz
processor. CPI and the proportion of four instruction types are given below.
Compute the average CPI:

instruction type CPI instruction mix


Arithmetic and logic 1 60%
Load/store with cache hit 2 18%
Branch 4 12%
Load/store with cache miss 8 10%

Answer:
average CPI is CPI = 0.6+ (2  0.18) + (4  0.12) + (8  0.1) = 2.24

Designing for Performance 28


Homework 2
Consider two hardware implementations M1 and M2 of the same instruction set.
There are three instruction classes: F, I and N. The M1 clock rate is 600 Mhz.
The clock cycle of M2 is 2 ns. The average CPI for these three instruction classes
are
Class CPI of M1 CPI of M2 Comments
F 5.0 4.0 floating-point
I 2.0 3.8 integer
N 2.4 2.0 non-arithmetic
a) Compute the peak performance for M1 and M2 in MIPS.
b) If 50% of the instruction executed in a given program belong to class N and
the other are equally distributed between F and I, which is the fastest
machine and by which factor?

Designing for Performance 29


Homework 2
c) A designer of M1 plan to change the project to improve performance.
Assuming the information in (b). Which of the options below should be
more beneficial?
1. Use a FPU twice as fast (CPI=2,5 for class F).
2. Add a second ALU to reduce the CPI for integer operations to 1.20
3. Use a faster logic that allows a clock rate of 750 MHz keeping the same
CPI values?
d) The CPI given above include a cache miss that occurs 5 times per 100
executed instructions. Each cache miss imply in a 10 cycles penalty. The
forth redesign option consists of using a larger instruction cache so as to
reduce the miss ratio from 5% to 3%. Compare this alternative with the
options before.
e) Characterize application programs that can be executed faster in M1 than in
M2, i. e., discuss the instruction composition of such applications. Hint: Let
x, y and 1-x-y the fraction of instructions belonging to classes F, I and N
respectively.

Designing for Performance 30


Homework 3
A processor is used for an application where 30 %, 25% and 10% of the
processing time is spent with floating-point addition, multiplication and division,
respectively. For a new processor version, 3 alternatives are being considered, all
of them involving nearly the same design and implementation cost. Which one
should be selected?
a) Redesign the adder making it twice as fast as the older one.
b) Redesign the multiplier making it three times as fast as the older one
c) Redesign the divider making it ten times as fast as the older one.

Designing for Performance 31


Homework 4
T is the average processing time of a computer operating at frequency f.
Instructions are grouped in 3 types, as shown below.
Instruction type CPI
Floating point arithmetic 10
Integer arithmetic 5
Non- arithmetic 2
Typically a program executes the same proportion of instructions from all three
groups/types. Compute the MIPS and the new execution time, if the FPU
becomes twice as fast.

Designing for Performance 32


Homework 5
Let f1 and f2 be the operation frequency of processors P1 and P2 respectively.
Assume that two compilers generate different executable codes for the same
source program which may be executed byP1 as well as byP2 . The codes have
the characteristics given below:

Proportion Proportion
Instruction type CPI
compiler 1 compiler 2
Floating point arithmetic 10 20 % 30 %
Integer arithmetic 5 30 % 10 %
Non- arithmetic 2 50 % 60 %

Compute the ratio f1/f2 for which the processing time in P1 executing code 1
equals the processing time of P2 executing code 2.

Designing for Performance 33


Homework 6
The code of an application can be separated in a sequential part (S) and in a
parallelizable part (P). The number of executed instructions of type P is twice as many as
of type S, when the application runs in a single processor. When the application runs in
multiple processors the number of instructions of type S increases in 10%. Consider the
following two configurations:

A) Single processor machine operating with frequency 2f.


B) Four processors machine operating with frequency f.

a) Determine the limit ratio r between the CPI of instructions of type P and type S
(r=CPIP /CPIS), for which the configuration A) is faster than configuration B).
b) Compute the upper limit for the speed up that can be achieved using multiple processors
without changing the operation frequency.

Designing for Performance 34


Designing for Performance

END
15-17, 24,28,31-25

Designing for Performance 35

You might also like