0% found this document useful (0 votes)
48 views67 pages

4 Performance

The document discusses measuring and reporting computer system performance. It covers topics like CPU components, memory, caches, benchmarks, and performance metrics. Effective performance measurement requires considering multiple dimensions like execution time by instruction type, cache bandwidth, and I/O performance.

Uploaded by

Laith Qasem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views67 pages

4 Performance

The document discusses measuring and reporting computer system performance. It covers topics like CPU components, memory, caches, benchmarks, and performance metrics. Effective performance measurement requires considering multiple dimensions like execution time by instruction type, cache bandwidth, and I/O performance.

Uploaded by

Laith Qasem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Measuring & Reporting

Performance

Aziz Qaroush
Review: Computer System Components
CPU Core
1 GHz - 3.8 GHz
4-way Superscaler
All Non-blocking caches
RISC or RISC-core (x86):
Deep Instruction Pipelines
L1 16-128K 1-2 way set associative (on chip), separate or unified
Dynamic scheduling L1 L2 256K- 2M 4-32 way set associative (on chip) unified
Multiple FP, integer FUs CPU L3 2-16M 8-32 way set associative (off or on chip) unified
Dynamic branch prediction L2
Hardware speculation Examples: Alpha, AMD K7: EV6, 200-400 MHz
L3 Intel PII, PIII: GTL+ 133 MHz
SDRAM Caches Intel P4 800 MHz
PC100/PC133
100-133MHZ Front Side Bus (FSB)
64-128 bits wide
2-way inteleaved Off or On-chip
~ 900 MBYTES/SEC )64bit) Memory
adapters I/O Buses
Current Standard Controller Example: PCI, 33-66MHz
32-64 bits wide
Double Date 133-528 MBYTES/SEC
Rate (DDR) SDRAM Memory Bus
PC3200
Controllers NICs PCI-X 133MHz 64 bit
1024 MBYTES/SEC
200 MHZ DDR
64-128 bits wide Memory
4-way interleaved Disks
~3.2 GBYTES/SEC
(one 64bit channel)
Displays Networks
~6.4 GBYTES/SEC Keyboards
(two 64bit channels)

RAMbus DRAM (RDRAM) I/O Devices:


North South
400MHZ DDR
16 bits wide (32 banks) Bridge Bridge I/O Subsystem
~ 1.6 GBYTES/SEC Chipset
2
Architecture continually changing
Applications
suggest how Improved
Application
to improve technologies
s
technology, make new
provide applications
revenue to possible
Technology
fund
development

Cost of software development


makes compatibility a major
force in market
3
Review: What is Computer Architecture?

I/O Chan
Link
ISA
API
Interfaces
Technology
IR

Regs

Machine Organization

Computer
Applications
Architect
Measurement &
Evaluation

4
The Architecture Process

Estimate
Cost & Sort
Performance

New concepts
created
Good
Mediocre ideas
Bad ideas
ideas

5
What is Performance?
• How can we make intelligent choices about computers?

• Why is some computer hardware performs better at


some programs, but performs less at other programs?

• How do we measure the performance of a computer?

• What factors are hardware related? software related?

• How does machine’s instruction set affect performance?

• Understanding performance is key to understanding


underlying organizational motivation
Measuring performance
• We need measures
– Comparison of machine properties
– Comparison of software properties (compilers)
• Purpose
– Making purchase decisions
– Development of new architectures
• Is a single measure sufficient?
– A machine with 600 MHz clock cycle is faster than 500 MHz
clock cycle!?
– Why do we still have mainframes?

7
Performance Measurement and Evaluation
• Many dimensions to
computer performance P
– CPU execution time
• by instruction or sequence
– floating point
– integer C
– branch performance
– Cache bandwidth
– Main memory bandwidth
– I/O performance M
• bandwidth
• seeks
• pixels or polygons per
second
• Relative importance
depends on applications

8
Evaluation Tools
• Benchmarks, traces, & mixes
– macrobenchmarks & suites
MOVE 39%
• application execution time BR 20%
– microbenchmarks LOAD 20%
• measure one aspect of STORE 10%
performance ALU 11%
– traces
• replay recorded accesses
– cache, branch, register
• Simulation at many levels
– ISA, cycle accurate, RTL, gate,
circuit
• trade fidelity for simulation rate
• Area and delay estimation
• Analysis
– e.g., queuing theory
– Fundamentals Laws

9
Metrics of Computer Performance

Application Answers per month


Operations per second
Programming
Language
Compiler
(millions) of Instructions per second: MIPS
ISA (millions) of (FP) operations per second: MFLOP/s
Datapath
Control Megabytes per second
Function Units
Transistors Wires Pins Cycles per second (clock rate)

Each metric has a purpose, and each can be misused.

10
Benchmarks and Benchmarking

Some definitions are:


• It is a test that measures the performance
of a system or subsystem on a well-
defined task or set of task.
• A method of comparing the performance
of different computer architecture.
• Or a method of comparing the
performance of different software
Some Warnings about Benchmarks

• Benchmarks measure the • Benchmark timings often


whole system very sensitive to
– application – alignment in cache
– compiler – location of data on disk
– operating system – values of data
– architecture • Benchmarks can lead to
– implementation inbreeding or positive
• Popular benchmarks feedback
typically reflect yesterday’s – if you make an operation
programs fast (slow) it will be used
– computers need to be more (less) often
designed for tomorrow’s • so you make it faster
programs (slower)
– and it gets used even
more (less)
» and so on…

12
Choosing Programs To Evaluate Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the target
machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical of targeted
applications or workload (e.g SPEC95, SPEC CPU2000).
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific aspect of
performance characteristics: Processing: integer, floating point, local
memory, input/output, etc.

13
Types of Benchmarks
Pros Cons
• Very specific.
• Representative Actual Target Workload • Non-portable.
• Complex: Difficult
to run, or measure.

• Portable.
• Widely used. • Less representative
Full Application Benchmarks
• Measurements than actual workload.
useful in reality.

Small “Kernel” • Easy to “fool” by


• Easy to run, early in designing hardware
the design cycle. Benchmarks
to run them well.

• Peak performance
• Identify peak results may be a long
performance and Microbenchmarks
way from real application
potential bottlenecks. performance
SPEC: System Performance Evaluation
Cooperative
The most popular and industry-standard set of CPU
benchmarks.
• SPECmarks, 1989:
– 10 programs yielding a single number (“SPECmarks”).
• SPEC92, 1992:
– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs).
• SPEC95, 1995:
– SPECint95 (8 integer programs):
• go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
– SPECfp95 (10 floating-point intensive programs):
• tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
– Performance relative to a Sun SuperSpark I (50 MHz) which is given a score
of SPECint95 = SPECfp95 = 1
• SPEC CPU2000, 1999:
– CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive
programs)
– Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score
of SPECint2000 = SPECfp2000 = 100
15
Application Performance: Intel Core i9-
12900K vs Ryzen 9 5950X and Ryzen 9 5900X

16
Power Consumption, Efficiency, and Cooling: Intel
Core i9-12900K vs Ryzen 9 5950X and Ryzen 9 5900X

17
Architectural Performance Laws and Rules
of Thumb
• Measurement and Evaluation
– Architecture is an iterative process:
• Searching the space of possible designs
• Make selections
• Evaluate the selections made
– Good measurement tools are required to accurately evaluate the
selection.
• Measurement Tools
– Benchmarks, Traces, Mixes
– Cost, delay, area, power estimation
– Simulation (many levels)
• ISA, RTL, Gate, Circuit
– Queuing Theory
– Rules of Thumb
– Fundamental Laws

18
Time as a Measure of Performance
• Response Time
– Time between start and completion of a task

– As observed and measured by the end user

– Called also Wall-Clock Time or Elapsed Time

– Response Time = CPU Time + Waiting Time (I/O, scheduling, etc.)

• CPU Execution Time


– Time spent executing the program instructions

– CPU time = User CPU time + Kernel CPU time

– Can be measured in seconds. msec, µsec, etc.

– Can be related to the number of CPU clock cycles

• Our focus: user CPU time


– Time spent executing the lines of code that are "in" our program
Throughput as a Performance Metric
• Throughput = Total work done per unit of time
– Tasks per hour
– Transactions per minute

• Decreasing the execution time improves throughput


– Example: using a faster version of a processor

– Less time to run a task  more tasks can be executed per unit of time

• Parallel hardware improves throughput and response time


– By increasing the number of processors in a multiprocessor

– More tasks can be executed in parallel

– Execution time of individual sequential tasks is not changed

– Less waiting time in queues reduces (improves) response time


CPU Performance Evaluation
• Most computers run synchronously utilizing a CPU clock running at
Clock cycle
a constant clock rate:
Cycles/sec = Hertz = Hz

cycle 1 cycle 2 cycle 3


where: Clock rate = 1 / clock cycle
• The CPU clock rate depends on the specific CPU organization
(design) and hardware implementation technology (VLSI) used
• A computer machine (ISA) instruction is comprised of a number of
elementary or micro operations which vary in number and complexity
depending on the instruction and the exact CPU organization
(Design)
– A micro operation is an elementary hardware operation that can be performed
during one CPU clock cycle.
– This corresponds to one micro-instruction in microprogrammed CPUs.
– Examples: register operations: shift, load, clear, increment, ALU operations: add ,
subtract, etc.
• Thus a single machine instruction may take one or more CPU cycles
to complete termed as the Cycles Per Instruction (CPI).
– Average CPI of a program: The average CPI of all instructions executed in the
program on a given CPU design.
21
Generic CPU Machine Instruction Execution
Steps

22
Computer Performance Measures: Program
Execution Time
• For a specific program compiled to run on a specific
machine (CPU) “A”, has the following parameters:
– The total executed instruction count of the program. I
– The average number of cycles per instruction (average CPI). CPI
– Clock cycle of machine “A” C
• How can one measure the performance of this machine
(CPU) running this program?
– Intuitively the machine (or CPU) is said to be faster or has
better performance running this program if the total execution
time is shorter.
– Thus the inverse of the total measured program execution
time is a possible performance measure or metric:
PerformanceA = 1 / Execution TimeA
How to compare performance of different machines?
What factors affect performance? How to improve
performance? 23
Comparing Computer Performance Using
Execution Time
• To compare the performance of two machines (or CPUs) “A”, “B” running a
given specific program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
• Machine A is n times faster than machine B means (or slower? if n < 1)
PerformanceA Execution TimeB
Speedup = n = =
PerformanceB Execution TimeA

• Example: (i.e Speedup is ratio of performance, no units)


For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
Speedup= PerformanceA / PerformanceB = Execution TimeB / Execution TimeA
= 10 / 1 = 10
The performance of machine A is 10 times the performance of
machine B when running this program, or: Machine A is said to be 10
times faster than machine B when running this program.
24
CPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions
executed , I
– Measured in: instructions/program
• The average instruction executed takes a number of cycles
Or Instructions Per Cycle (IPC):
per instruction (CPI) to be completed. IPC= 1/CPI
– Measured in: cycles/instruction, CPI
• CPU has a fixed clock cycle time C = 1/clock rate
– Measured in: seconds/cycle
• CPU execution time is the product of the above three
parameters as follows:
Executed

CPU time = Seconds = Instructions x Cycles x Seconds


Program Program Instruction Cycle

T = I x CPI x C
Execution Time Number of Average CPI for program CPU Clock Cycle
per program in seconds instructions executed
25
(This equation is commonly known as the CPU performance equation)
CPU Average CPI/Execution Time
For a given program executed on a given machine (CPU):

CPI = Total program execution cycles / Instructions count


(average)

→ CPU clock cycles = Instruction count x CPI

CPU execution time =

= CPU clock cycles x Clock cycle


= Instruction count x CPI x Clock cycle
T = I x CPI x C

execution Time Number of Average CPI CPU Clock Cycle


per program in seconds instructions executed for program

(This equation is commonly known as the CPU performance equation) 26


Improving the performance

CPU time = Seconds = Instructions x Cycles x Seconds


Program Program Instruction Cycle

• Increase the clock frequency =


– reduce the clock period
• Reduce the number of cycles for the program
• Reduce the number of instructions

27
Instruction = cycle?
• Is the number of cycles identical with the number of
instructions?
– No!
• The number of cycles depends on the implementation of the
operations in hardware
– The number differs for each processor
– Why?
• Operations take different time
– Multiplication takes longer than addition
– Floating point operations take longer than integer operations
• The access time to a register is much shorter than to memory
location

28
Aspects of CPU Execution Time
CPU Time = Instruction count x CPI x Clock cycle

Depends on:
T = I x CPI x C

Program Used
Compiler
ISA

Instruction Count I

(executed)

Depends on:
Depends on:
Program Used
CPI Clock CPU Organization
Compiler Cycle Technology (VLSI)
ISA (Average C
CPU Organization CPI)
29
Factors Affecting CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle

Instruction
CPI Clock Cycle C
Count I
Program X X
Compiler X X
Instruction Set
Architecture (ISA) X X
Organization X X
(CPU Design)

Technology X
(VLSI)

30
CPU Execution Time: Example
• A Program is running on a specific machine (CPU) with the
following parameters:
– Total executed instruction count: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds)
• What is the execution time for this program:

CPU time = Seconds = Instructions x Cycles x Seconds


Program Program Instruction Cycle

CPU time = Instruction count x CPI x Clock cycle


= 10,000,000 x 2.5 x 1 / clock rate
= 10,000,000 x 2.5 x 5x10-9
= .125 seconds
T = I x CPI x C
31
Performance Comparison: Example
• From the previous example: A Program is running on a specific
machine (CPU) with the following parameters:
– Total executed instruction count, I: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz.
• Using the same program with these changes:
– A new compiler used: New executed instruction count, I: 9,500,000
New CPI: 3.0
– Faster CPU implementation: New clock rate = 300 MHz
• What is the speedup with the changes?
Speedup = Old Execution Time = Iold x CPIold x Clock cycleold
New Execution Time Inew x CPInew x Clock Cyclenew

Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 )


= .125 / .095 = 1.32
or 32 % faster after changes.

Clock Cycle = 1/ Clock Rate T = I x CPI x C 32


Instruction Types & CPI
• Given a program with n types or classes of instructions
executed on a given CPU with the following
characteristics:
i = 1, 2, …. n
Ci = Count of instructions of typei executed
CPIi = Cycles per instruction for typei
Then:
CPI = CPU Clock Cycles / Instruction Count I

CPU clockcycles =  (CPI i  C i )


Where: n

i =1

Executed Instruction Count I = S Ci


33
Instruction Types & CPI: An Example
• An instruction set has three instruction classes:
Instruction class CPI
A 1 For a specific
CPU design
B 2
C 3
• Two code sequences have the following instruction counts:
Instruction counts for instruction class
Code Sequence A B C
1 2 1 2
2 4 1 1
• CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles
CPI for sequence 1 = clock cycles / instruction count
= 10 /5 = 2
• CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles
CPI for sequence 2 = 9 / 6 = 1.5

(CPI  C )
n
CPU clock cycles =  i i
CPI = CPU Cycles / I 34
i =1
Instruction Frequency & CPI
• Given a program with n types or classes of instructions with
the following characteristics:

Ci = Count of instructions of typei i = 1, 2, …. n


CPIi = Average cycles per instruction of typei
Fi = Frequency or fraction of instruction typei executed
= Ci/ total executed instruction count = Ci/ I
Then:

CPI =  (CPI i  F i )
n

i =1

CPIi x Fi
Fraction of total execution time for instructions of type i =
CPI
35
Instruction Type Frequency & CPI:
A RISC Example
CPIi x Fi
Program Profile or Executed Instructions Mix
CPI
Base Machine (Reg / Reg)
Op Freq, Fi CPIi CPIi x Fi % Time
Given ALU 50% 1 .5 23% = .5/2.2
Load 20% 5 1.0 45% = 1/2.2
Store 10% 3 .3 14% = .3/2.2
Branch 20% 2 .4 18% = .4/2.2

Typical Mix
Sum = 2.2

CPI =  (CPI i  F i )
n

i =1

CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
= .5 + 1 + .3 + .4
36
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• For a specific program running on a specific CPU the MIPS rating is a
measure of how many millions of instructions are executed per
second:
MIPS Rating = Instruction count / (Execution Time x 106)
= Instruction count / (CPU clocks x Cycle time x 106)
= (Instruction count x Clock rate) / (Instruction count x CPI x 106)
= Clock rate / (CPI x 106)
• Major problem with MIPS rating: As shown above the MIPS rating
does not account for the count of instructions executed (I).
– A higher MIPS rating in many cases may not mean higher performance or
better execution time. i.e. due to compiler design variations.
• In addition the MIPS rating:
– Does not account for the instruction set architecture (ISA) used.
• Thus it cannot be used to compare computers/CPUs with different instruction
sets.
– Easy to abuse: Program used to get the MIPS rating is often omitted.
• Often the Peak MIPS rating is provided for a given CPU which is obtained
using a program comprised entirely of instructions with the lowest CPI for the
given CPU design which does not represent real programs. 37
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• Under what conditions can the MIPS rating be used to
compare performance of different CPUs?
• The MIPS rating is only valid to compare the performance of
different CPUs provided that the following conditions are
satisfied:
1 The same program is used
(actually this applies to all performance metrics)
2 The same ISA is used
3 The same compiler is used
 (Thus the resulting programs used to run on the CPUs
and obtain the MIPS rating are identical at the machine
code level including the same instruction count)

38
Wrong!!!
• 3 significant problems with using MIPS:
– Problem 1:
• MIPS is instruction set dependent.
• (And different computer brands usually have different instruction
sets)

– Problem 2:
• MIPS varies between programs on the same computer

– Problem 3:
• MIPS can vary inversely to performance!
• Let’s look at an examples of why MIPS doesn’t work…

39
Compiler Variations, MIPS & Performance:
An Example
• For a machine (CPU) with instruction classes:

Instruction class CPI


A 1
B 2
C 3

• For a given high-level language program, two compilers


produced the following executed instruction counts:
Instruction counts (in millions)
for each instruction class
Code from: A B C
Compiler 1 5 1 1
Compiler 2 10 1 1

• The machine is assumed to run at a clock rate of 100 MHz.


40
Compiler Variations, MIPS & Performance:
An Example (Continued)
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
CPI = CPU execution cycles / Instructions count

(CPI  C )
n
CPU clock cycles =  i i
i =1
CPU time = Instruction count x CPI / Clock rate

• For compiler 1:
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIPS Rating1 = 100 / (1.428 x 106) = 70.0 MIPS
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds

• For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIPS Rating2 = 100 / (1.25 x 106) = 80.0 MIPS
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds

MIPS rating indicates that compiler 2 is better


while in reality the code produced by compiler 1 is faster 41
MIPS (The ISA not the metric) Loop High Memory

Performance Example $6 points here


X[999]
X[998]
Last element to
compute
For the loop:
.
for (i=0; i<1000; i=i+1){ .
x[i] = x[i] + s; } .
.
$2 initially
MIPS assembly code is given by: points here X[0] First element to
lw $3, 8($1) ; load s in $3 Low Memory compute

addi $6, $2, 4000 ; $6 = address of last element + 4


loop: lw $4, 0($2) ; load x[i] in $4
add $5, $4, $3 ; $5 has x[i] + s
sw $5, 0($2) ; store computed x[i]
addi $2, $2, 4 ; increment $2 to point to next x[ ] element
bne $6, $2, loop ; last loop iteration reached?
The MIPS code is executed on a specific CPU that runs at 500 MHz (clock cycle = 2ns =
2x10-9 seconds) with following instruction type CPIs :

For this MIPS code running on this CPU find:


Instruction type CPI 1- Fraction of total instructions executed for each instruction type
ALU 4 2- Total number of CPU cycles
Load 5 3- Average CPI
Store 7 4- Fraction of total execution time for each instructions type
Branch 3 5- Execution time
6- MIPS rating , peak MIPS rating for this CPU
X[ ] array of words in memory, base address in $2 , 42
s a constant word value in memory, address in $1
MIPS (The ISA) Loop Performance
Example (continued)
• The code has 2 instructions before the loop and 5 instructions in the body of the loop which iterates
1000 times,
• Thus: Total instructions executed, I = 5x1000 + 2 = 5002 instructions
1 Number of instructions executed/fraction Fi for each instruction type:
– ALU instructions = 1 + 2x1000 = 2001 CPIALU = 4 FractionALU = FALU = 2001/5002 = 0.4 = 40%
– Load instructions = 1 + 1x1000 = 1001 CPILoad = 5 FractionLoad = FLoad = 1001/5002= 0.2 = 20%
– Store instructions = 1000 CPIStore = 7 FractionStore = FStore = 1000/5002 = 0.2 = 20%
– Branch instructions = 1000 CPIBranch = 3 FractionBranch= FBranch = 1000/5002= 0.2 = 20%

 (CPI  C )
n
Instruction type CPI
2 CPU clock cycles = i i
i =1 ALU 4
Load 5
= 2001x4 + 1001x5 + 1000x7 + 1000x3 = 23009 cycles Store 7
3 Average CPI = CPU clock cycles / I = 23009/5002 = 4.6 Branch 3
4 Fraction of execution time for each instruction type:
– Fraction of time for ALU instructions = CPIALU x FALU / CPI= 4x0.4/4.6 = 0.348 = 34.8%
– Fraction of time for load instructions = CPIload x Fload / CPI= 5x0.2/4.6 = 0.217 = 21.7%
– Fraction of time for store instructions = CPIstore x Fstore / CPI= 7x0.2/4.6 = 0.304 = 30.4%
– Fraction of time for branch instructions = CPIbranch x Fbranch / CPI= 3x0.2/4.6 = 0.13 = 13%
5 Execution time = I x CPI x C = CPU cycles x C = 23009 x 2x10-9 =
= 4.6x 10-5 seconds = 0.046 msec = 46 usec
6 MIPS rating = Clock rate / (CPI x 106) = 500 / 4.6 = 108.7 MIPS
– The CPU achieves its peak MIPS rating when executing a program that only has instructions of the type with
the lowest CPI. In this case branches with CPIBranch = 3
– Peak MIPS rating = Clock rate / (CPIBranch x 106) = 500/3 = 166.67 MIPS 43
Computer Performance Measures :MFLOPS
• A floating-point operation is an addition, subtraction,
multiplication, or division operation applied to numbers
represented by a single or a double precision floating-point
representation.
• MFLOPS, for a specific program running on a specific
computer, is a measure of millions of floating point-operation
(megaflops) per second:
MFLOPS =
Number of floating-point operations / (Execution time x 106 )

• MFLOPS rating is a better comparison measure between


different machines (applies even if ISAs are different) than the
MIPS rating.
– Applicable even if ISAs are different

44
Computer Performance Measures :MFLOPS
• Program-dependent: Different programs have different
percentages of floating-point operations present. i.e
compilers have no floating- point operations and yield a
MFLOPS rating of zero.
• Dependent on the type of floating-point operations
present in the program.
– Peak MFLOPS rating for a CPU: Obtained using a program
comprised entirely of the simplest floating point
instructions (with the lowest CPI) for the given CPU design
which does not represent real floating point programs.

45
Quantitative Principles of Computer Design
• Amdahl’s Law:
– The performance gain from improving some portion of
a computer is calculated by:

Speedup = Performance for entire task using the enhancement


Performance for the entire task without using the enhancement

or Speedup = Execution time without the enhancement


Execution time for entire task using the enhancement

46
Performance Enhancement Calculations:
Amdahl's Law
• The performance enhancement possible due to a given design
improvement is limited by the amount that the improved feature is
used
• Amdahl’s Law:
– Performance improvement or speedup due to enhancement E:
Execution Time without E Performance with E
Speedup(E) = ------------------------------------ = ------------------------------
Execution Time with E Performance without E
– Suppose that enhancement E accelerates a fraction F of the execution
time by a factor S and the remainder of the time is unaffected then:
Execution Time with E = ((1-F) + F/S) X Execution Time without E
Hence speedup is given by:
Execution Time without E 1
Speedup(E) = --------------------------------------------------------- = -----------------
((1 - F) + F/S) X Execution Time without E (1 - F) + F/S

F (Fraction of execution time enhanced) refers


to original execution time before the enhancement is applied 47
Pictorial Depiction of Amdahl’s Law
Enhancement E accelerates fraction F of original execution time by a factor of S

Before:
Execution Time without enhancement E: (Before enhancement is applied)
• shown normalized to 1 = (1-F) + F =1

Unaffected fraction: (1- F) Affected fraction: F

Unchanged

Unaffected fraction: (1- F) F/S


After:
Execution Time with enhancement E:

Execution Time without enhancement E 1


Speedup(E) = ------------------------------------------------------ = ------------------
Execution Time with enhancement E (1 - F) + F/S
48
Example of Amdahl’s Law
• Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP

ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold

Speedupoverall = 1 = 1.053
0.95

49
Performance Enhancement Example
• For the RISC machine with the following instruction mix given
earlier:
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23% CPI = 2.2
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Fraction enhanced = F = 45% or .45
Unaffected fraction = 1- F = 100% - 45% = 55% or .55
Factor of enhancement = S = 5/2 = 2.5
Using Amdahl’s Law:
1 1
Speedup(E) = ------------------ = --------------------- = 1.37
(1 - F) + F/S .55 + .45/2.5
50
An Alternative Solution Using CPU Equation
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45% CPI = 2.2
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
• If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from
this enhancement:
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6

Original Execution Time Instruction count x old CPI x clock cycle


Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle

old CPI 2.2


= ------------ = --------- = 1.37
new CPI 1.6

Which is the same speedup obtained from Amdahl’s Law in the


first solution. 51
Performance Enhancement Example
• A program runs in 100 seconds on a machine with multiply
operations responsible for 80 seconds of this time. By how much
must the speed of multiplication be improved to make the program
four times faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
→ Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
→ 5 = 80 seconds / S
→ S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of
execution time:
F = 80/100 = .8
Solving for S gives S= 16
1 1 1
Speedup(E) = ------------------ = 4 = ----------------- = ---------------
(1 - F) + F/S (1 - .8) + .8/S .2 + .8/s
and then solving Amdahl’s speedup equation for desired enhancement factor
S Hence multiplication should be 16 times faster to get an overall speedup
of 4. 52
Performance Enhancement Example
• For the previous example with a program running in 100 seconds on
a machine with multiply operations responsible for 80 seconds of
this time. By how much must the speed of multiplication be
improved to make the program five times faster?

100
Desired speedup = 5 = -----------------------------------------------------
Execution Time with enhancement

→ Execution time with enhancement = 100/5 = 20 seconds

20 seconds = (100 - 80 seconds) + 80 seconds / s


20 seconds = 20 seconds + 80 seconds / s
→ 0 = 80 seconds / s

No amount of multiplication speed improvement can achieve this.

53
Extending Amdahl's Law To Multiple
Enhancements
• Suppose that enhancement Ei accelerates a fraction Fi of the
original execution time by a factor Si and the remainder of the time
is unaffected then:

Original Execution Time


Speedup =
((1 −  F ) +  F ) XOriginal Execution Time
i i i
i

Unaffected fraction
S i

1
Speedup =
((1 −  F ) +  F )
i i i
i

S i

Note: All fractions Fi refer to original execution time before the


enhancements are applied.
54
Amdahl's Law With Multiple Enhancements:
Example
• Three CPU performance enhancements are proposed with the following
speedups and percentage of the code execution time affected:

Speedup1 = S1 = 10 Percentage1 = F1 = 20%


Speedup2 = S2 = 15 Percentage1 = F2 = 15%
Speedup3 = S3 = 30 Percentage1 = F3 = 10%

• While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
1
Speedup =
((1 −  F ) +  F )
i i i
i

S i

• Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)]


= 1/ [ .55 + .0333 ]
= 1 / .5833 = 1.71
55
Pictorial Depiction of Example
Before:
Execution Time with no enhancements: 1
S1 = 10 S2 = 15 S3 = 30

Unaffected, fraction: .55 F1 = .2 F2 = .15 F3 = .1

/ 10 / 15 / 30

Unchanged

Unaffected, fraction: .55

After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833

Speedup = 1 / .5833 = 1.71

Note: All fractions refer to original execution time.


56
Performance and Power
• Power is a key limitation
– Battery capacity has improved only slightly over time
• Need to design power-efficient processors
• Reduce power by
– Reducing frequency
– Reducing voltage
– Putting components to sleep
• Energy efficiency
– Important metric for power-limited applications
– Defined as performance divided by power consumption
Dynamic Energy and Power
• Dynamic energy
– Transistor switch from 0 -> 1 or 1 -> 0
– ½ x Capacitive load x Voltage2

• Dynamic power
– ½ x Capacitive load x Voltage2 x Frequency switched

• Reducing clock rate reduces power, not energy


Power & Clock Rate
Performance and Power
1 .6
P e ntiu m M @ 1 .6 /0 .6 G H z
P e ntiu m 4 -M @ 2 .4 /1 .2 G H z
1 .4
P e ntiu m III- M @ 1 .2 /0 .8 G H z
Relative Performance

1 .2

1 .0

0 .8

0 .6

0 .4

0 .2

0 .0
S P E C IN T 2 0 00 S P E C F P2 0 00 S P E C IN T 200 0 S P E C F P 2 000 S P E C IN T 2 00 0 S P E C FP 2 0 0 0

Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock

Benchmark and Power Mode


Energy Efficiency
Pentium M @ 1.6/0.6 GHz
Pentium 4-M @ 2.4/1.2 GHz
Relative Energy Efficiency

Pentium III-M @ 1.2/0.8 GHz

Energy efficiency of the Pentium M is


highest for the SPEC2000 benchmarks

SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000 SPECINT 2000 SPECFP 2000

Always on / maximum clock Laptop mode / adaptive clock Minimum power / min clock

Benchmark and power mode


Chip Manufacturing Process
Silicon ingot Blank wafers

Slicer Hundreds of Steps

30 cm 1 mm thick
diameter
Tested dies Individual dies Patterned wafer

Die
Dicer
Tester

Packaged dies Tested Packaged


dies
Bond die to Part Ship to
package Tester Customers
Effect of Die Size on Yield
Good Die

Defective Die

120 dies, 109 good 26 dies, 15 good

Dramatic decrease in yield with larger dies

Yield = (Number of Good Dies) / (Total Number of Dies)


1
Yield =
(1 + (Defect per area  Die area / 2))2

Die Cost = (Wafer Cost) / (Dies per Wafer  Yield)


Integrated Circuit Cost
• Integrated circuit

Yield = (Number of Good Dies) / (Total Number of Dies)


1
Yield =
(1 + (Defect per area  Die area / 2))2
Things to Remember
• Performance is specific to a particular program
– Any measure of performance should reflect execution time
– Total execution time is a consistent summary of
performance
• For a given ISA, performance improvements come
from
– Increases in clock rate (without increasing the CPI)
– Improvements in processor organization that lower CPI
– Compiler enhancements that lower CPI and/or instruction
count
– Algorithm/Language choices that affect instruction count
• Pitfalls (things you should avoid)
– Using a subset of the performance equation as a metric
– Expecting improvement of one aspect of a computer to
increase performance proportional to the size of
improvement
Example
You are going to enhance a machine and there are
two types of possible improvements: either
• make multiply instructions run 4 times faster, or
• make memory access instructions run two times
faster than before.
You repeatedly run a program that takes 100
seconds to execute (on the original machine) and
find that of this time 25% is used for multiplication,
50% for memory access instructions, and 25% for
other tasks.

66
Example
1. What will the speedup be if you improve both
multiplication and memory access?
2. Assume the program you run has 10 billions
instructions and runs on the machine that has a
clock rate of 1GHz. Calculate the CPI for this
machine. Assume further that the CPI for
multiplication instructions is 20 cycles and the CPI
for memory access instructions is 6 cycles. Compute
the CPI for all other instructions.
3. What is the CPI for the improved machine when
improvements on both multiplication and memory
access instructions are made?

67

You might also like