Lecture 02 CH01 Performance Power
Lecture 02 CH01 Performance Power
1
Evaluation is the first step
◼ To build a good computer, first thing you have to do is to
evaluate how good it is.
Compiler
Interface
Evaluating
performance
2
Metrics Matters
◼ Performance
❑ How fast a program can execute
◼ Power
❑ How much energy is consumed
◼ Others metrics
❑ Yield
❑ Cost
❑ Etc….
3
Outline of this lecture
◼ Performance
❑ Basics of performance evaluations
❑ Basic idea of benchmarks
❑ Making the common case fast!
❑ Reporting performance results
◼ Power
❑ Basics of power/energy evaluations
❑ Reducing energy by going from uni-processor to multi-
processors
4
Performance
◼ Why do we care about performance evaluation?
❑ Purchasing perspective
◼ given a collection of machines, which has the
❑ best performance ?
❑ least cost ?
❑ best performance / cost ?
❑ best performance / energy ?
❑ Design perspective
◼ faced with design options, which has the
❑ best performance improvement ?
❑ least cost ?
❑ best performance / cost ?
6
Definitions
◼ Performance is in units of things-per-second
❑ bigger is better
◼ If we are primarily concerned with response time
❑ performance(x) = 1
execution_time(x)
7
Example of Relative Performance
(1) PerformanceA/PerformanceB = n
(2) Performance ratio: 15/10 = 1.5
(3) A is 1.5 times faster than B
8
Metrics for Performance Evaluation
Clock (cycles)
Data transfer
and computation
Update state
11
How to Improve Performance
seconds cycles seconds
=
program program cycle
either
12
Example of Improving Performance
◼ A program runs 10 second on 4GHz clock computer A. We are
trying to help a computer designer build a computer, B, that will run
this program in 6 seconds. The designer has determined that a
substantial increase in the clock rate is possible, but this increase
will affect the rest of the CPU design, causing computer B to
require 1.2 times as many clock cycles as computer A for this
program. What clock rate should we tell the designer to target?
2nd instruction
3rd instruction
1st instruction
4th
5th
6th
...
time
14
Different numbers of cycles for different
instructions
time
15
Performance Equation
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
❑ # of instructions in program?
◼ MIPS
❑ Million Instructions per Second
❑ A metrics commonly used by vendors to show how
high performance their CPUs are
18
Example #2 MIPS Performance Measure
(3) MIPS1=(5+1+1)x109/2.5x106=2800
MIPS2=(10+1+1)x109/3.75x106=3200
20
Example #4
◼ Suppose we have two implementations of the same instruction set
architecture (ISA).
For some program,
Machine A has a clock cycle time of 10 ns. and a CPI of 2.0
Machine B has a clock cycle time of 20 ns. and a CPI of 1.2
What machine is faster for this program, and by how much?
21
Example #4
◼ Suppose we have two implementations of the same instruction set
architecture (ISA).
For some program,
Machine A has a clock cycle time of 10 ns. and a CPI of 2.0
Machine B has a clock cycle time of 20 ns. and a CPI of 1.2
What machine is faster for this program, and by how much?
I x 2.0 x 10
Execution_Time (A)
=
Execution_Time (B) I x 1.2 x 20
22
Example #5
Instruction class CPI for this instruction class
A 1
B 2
C 3
(1) Seq1 = 2 + 1 + 2 = 5
Seq2 = 4 + 1 + 1 = 6
24
Aspects of CPU Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Algorithm X X
Programming
Language X X
Compiler X X
ISA X X X
(instruction set architecture)
25
Performance Improvement
Seconds Instructions Clock cycles Seconds
Time = =
pogram program Instruction Clock cycles
10
6
SPECint
0
50 100 150 200 250
Pentium Pro
27
How to get IC and CPI values?
◼ Small benchmarks
❑ nice for architects and designers
❑ easy to standardize
❑ can be abused
◼ SPEC (System Performance Evaluation Cooperative)
❑ companies have agreed on a set of real program and inputs
❑ can still be abused
❑ valuable indicator of performance (and compiler technology)
❑ latest: spec2010
29
SPEC Benchmarks
◼ CPU
❑ Computation-intensive workload for testing different
CPU architectures
❑ Two major sets:
◼ Integer (SPEC CINT)
◼ Floating point (SPEC CFP)
◼ High Performance Computing, OpenMP, MPI
❑ For testing parallel applications
◼ Power
◼ Web server
◼ More information in http://www.spec.org/
30
SPEC CINT2000
31
Now you have several results…
◼ Usually you will have results of
❑ Running the workload on the new machine
❑ Running the workload on the old/reference machine
❑ Execution times of all programs in the workload
◼ How do you report this clearly?
Programs of Exe. Time of Exe. Time of Speedup
the workload Reference New
A 1000 500 2
B 90 20 4.5
C 600 150 4
D 10 1 10
E 12600 300 42
F 1200 60 20 32
How to report results clearly?
33
Comparisons of Reporting Methods
Program Exe. Exe. Speedup Comparison of Execution Times
s of the Time of Time of
workload Referenc New Exe. Time of Reference Exe. Time of New
e 14000
A 1000 500 2
12000
B 90 20 4.5 Huge
In Seconds
C 600 150 4 10000
Differences!
D 10 1 10
8000
E 12600 300 42
F 1200 60 20 6000
4000
2000
0
A B C D E F Average
B 90 20 4.5 30
C 600 150 4 25
D 10 1 10 20
E 12600 300 42 15
10
F 1200 60 20
5
Avg. 2583 171 13.75
0
A B C D E F Average
35
Normalize to the reference machine
◼ Normalization
❑ adjusting values measured on different scales to a
notionally common scale
❑ Normalized to the reference machine
→ exe.new / exe.reference
normalized to reference
Progr Exe. Exe. Speed Norm 60%
ams Time Time up alized
of the of of to 50%
workl Refere New Refere
oad nce nce 40%
normalized to reference
Progr Exe. Exe. Speed Norm 60%
ams Time Time up alized
of the of of to 50%
workl Refere New Refere
oad nce nce 40%
◼ Performance
❑ Basics of performance evaluations
❑ Basic idea of benchmarks
❑ Reporting performance results
❑ Making the common case fast!
◼ Power
❑ Basics of power/energy evaluations
❑ Reducing energy by going from uni-processor to multi-
processors
38
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Performance w/ E
Speedup(E) = ------------- = -------------------
ExTime w/ E Performance w/o E
39
Amdahl’s Law
◼ Floating point instructions improved to run 2X;
but only 10% of actual instructions are FP
1
Speedupoverall = = 1.053
0.95
40
Eight Great Ideas
41
Example #6
◼ Our favorite program runs in 10 seconds on computer A, which has a
400 Mhz. clock. We are trying to help a computer designer build a new
machine B, that will run this program in 6 seconds. The designer can use
new (or perhaps more expensive) technology to substantially increase the
clock rate, but has informed us that this increase will affect the rest of the
CPU design, causing machine B to require 1.2 times as many clock cycles
as machine A for the same program. What clock rate should we tell the
designer to target?"
1
Execution_Time (A) 10 C x 400 x10^ 6
= = 1
6 1.2C x
Execution_Time (B) x
seconds cycles seconds
=
program program cycle
42
Outline of this lecture
◼ Performance
❑ Basics of performance evaluations
❑ Basic idea of benchmarks
❑ Reporting performance results
❑ Making the common case fast!
◼ Power
❑ Basics of power/energy evaluations
❑ Reducing energy by going from uni-processor to
multi-processors
43
With Moore’s Law…The Power Wall
Moore’s Law (1965): “The density of transistors in an
integrated circuit will double every year.” (18 months in fact)
Processor Performance increases
1600 Interl
Pentium
Power dissipation also increases…
1500
1400
1.58x per year
1300
1200
1100
1000
HP
900 9000
800 10000
700
DEC
500 1000 Alpha Nozzle
400 Nuclear
300
Reactor
200 DEC
1.35x per year
MIPS IBM HP
Alpha
100
R2000
100
Pow er1 9000
45
Basics of Power Consumption
E =t*P 150
Leakage Power
100 Dynamic Power
50
0 46
250 nm 180 nm 130 nm 100 nm 50 nm
Switch from Uniprocessor to Multiprocessor
Case 2 t
Energy
Processor 1
= 2t × (αC(0.5V)2(0,5f))
Processor 2 = 2t x 0.125αCV2f
t = 0.25t x αCV2f 47
§1.8 The Sea Change: The Switch to Multiprocessors
Uniprocessor Performance
◼ Multicore microprocessors
❑ More than one processor per chip
◼ Requires explicitly parallel programming
❑ Compare with instruction level parallelism
◼ Hardware executes multiple instructions at once
◼ Hidden from the programmer
❑ Hard to do
◼ Programming for performance
◼ Load balancing
◼ Optimizing communication and synchronization
Chapter 1 — Computer
Abstractions and
Technology — 49
Multiprocessor Trend
51
Design Challenges of Multicore Architecture
◼ Parallel programming
❑ Rewrite the originally sequential program to take
advantage of multiple processors
❑ OpenMP, POSIX, CUDA
◼ Load balance of processors
❑ How to schedule tasks onto processors?
◼ Communication & synchronization issue
52
Evaluations with SPECspeed 2017 Integer benchmarks on a
1.8 GHz Intel Xeon E5-2650L
Chapter 1 — Computer
Abstractions and
Technology — 53
SPEC Power Benchmark
◼ Power consumption of server at different
workload levels
❑ Performance: ssj_ops/sec
❑ Power: Watts (Joules/sec)
10 10
Overall ssj_ops per Watt = ssj_opsi poweri
i =0 i =0
Chapter 1 — Computer
Abstractions and
Technology — 54
SPECpower_ssj2008 for Xeon E5-2650L
Chapter 1 — Computer
Abstractions and
Technology — 55
Fallacy: Low Power at Idle
◼ Cost/performance is improving
❑ Due to underlying technology development
◼ Hierarchical layers of abstraction
❑ In both hardware and software
◼ Instruction set architecture
❑ The hardware/software interface
◼ Execution time: the best performance measure
◼ Power is a limiting factor
❑ Use parallelism to improve performance
Chapter 1 — Computer
Abstractions and
Technology — 57
Reading Assignment
◼ 2.1 ~ 2.7
58
Which of these airplanes has the best performance?
Airplane Passenger Cruising range Cruising speed Passenger throughput
Capacity (miles) (m.p.h.) (passengers x m.p.h.)
59
Which one is faster? Concorde or Boeing 747
60
FOR YOUR READING
62
Benchmarking the Intel Core i7
63
SPECINT 2006 running on Intel Core i7
64
SPECpower_ssj2008 running on Intel Xeon
X5650
65
Fallacies and Pitfalls
66
Further Reading
◼ Section 1.13
◼ Section 1.14
❑ Self study
67
Evolution of Intel Microprocessors : 4004
❑ 6 mm process
❑ 4500 transistors
❑ 2 MHz
❑ 8-bit word size
◼ Pipelining (1989)
❑ Floating point unit
❑ 8 KB cache
◼ Characteristics
❑ 1-0.6 mm process
❑ 1.2M transistors
❑ 25-100 MHz
❑ 32-bit word size
◼ Superscalar (1993)
❑ 2 instructions per cycle
◼ Characteristics
❑ 0.8-0.35 mm process
❑ 3.2M transistors
❑ 60-300 MHz
❑ 1.4-3.4 GHz
❑ Extended Memory
64 Technology
❑ HyperThreading
2.66 GHz
❑ Extended Memory
64 Technology
Courtesy of Intel Museum
❑ HyperThreading
78