06 CA (Performance Enhancement)
06 CA (Performance Enhancement)
Najeeb-Ur-Rehman
Assistant Professor
Department of Computer Science
Faculty of Computing & IT
University of Gujrat
# of Instructions Example
A compiler designer is trying to decide between
two code sequences for a particular machine.
Based on the hardware implementation, there
are three different classes of instructions: Class
A, Class B, and Class C, and they require one,
two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of
A, 1 of B, and 2 of C. The second sequence has 6
instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? How much?
What is the CPI for each sequence?
2
Performance
Performance is determined by execution time
Do any of the other variables equal performance?
# of cycles to execute program?
# of instructions in program?
# of cycles per second?
average # of cycles per instruction?
average # of instructions per second?
3
Quantitative Principles
Make common case fast
Favor the frequent case (simpler) over the infrequent
case.
For example, given that overflow in addition is
infrequent, favor optimizing the case when no
overflow occurs.
Objective
Determine the frequent case.
Determine how much improvement in performance is
possible by making it faster.
5
Amdahl’s law….
The parameter to use in measuring the
effect of Amdahl's Law is speedup:
Performanc e using enhancemen t
Speedup
Performanc e without using enhancemen t
or
Execution time without enhancemen t
Speedup
Execution time with enhancemen t
6
Speedup depends on two factors
The fraction of the computation time in the original
machine that can be converted to take advantage of
the enhancement
Fraction enhanced 1
The improvement gained by the enhanced execution
mode
Speedup enhanced 1
7
Example
Trip from point A to point B in two parts
A 20 C 50/20/4/1.7/0.3 B
A-C Trip C-B Trip Total Time C-B Speedup Overall Speedup
20 50 70 1 1
20 20 40 2.5 1.75
20 4 24 12.5 2.9
20 1.7 21.7 29.4 3.2
20 0.3 20.3 166.66 3.4
8
Amdahl’s law
Exec timenew = execution time after some enhancement
Exec timeold = execution time before any enhancement
Fractionenhanced = fraction of work using the enhancement
Speedupenhanced = speedup of enhanced mode
9
Conti…
10
Example – I
We are considering an enhancement to
the processor of a web server. The new
CPU is 20 times faster on search queries
than the old processor. The old
processor is busy with search queries
70% of the time, what is the speedup
gained by integrating the enhanced
CPU?
11
Example – I Solution
12
Example – II
Suppose that we are considering an
enhancement to the processor of a server
system used for web serving. The new CPU
is 10 times faster on computation in the Web
serving application that the original
processor. Assuming that the original CPU
is busy with computation 40% of the time
and is waiting for I/O 60% of the time, what
is the overall speedup gained by
incorporating the enhancement?
13
Example – II Solution
14
Amdahl’s law Example
Consider an enhancement that takes 20ns on a
machine with enhancement and 100ns on a
machine without. Assume enhancement can only
be used 30% of the time.
What is the overall speedup?
15
Corollary
If an enhancement is only usable for a fraction of a
task, we can’t speed up the task by more than the
reciprocal of 1 minus that fraction.
1
Performanc e Improvemen t Limit
1 - Fraction enhanced
16
Example
Frequency of FP instructions : 25%
Average CPI of FP instructions : 4.0
Average CPI of other instructions : 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Design Alternative 1: Reduce CPI of FPSQR
from 20 to 2.
Design Alternative 2: Reduce average CPI of all
FP instruction to 2.5
Compare these two design alternatives using
CPU Performance equation.
17
Solution
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0
18
Example – III
A common transformation required in graphics engines is
square root. Implementations of floating-point (FP) square
root vary significantly in performance, especially among
processors designed for graphics. Suppose FP square root
(FPSQR) is responsible for 20% of the execution time of a
critical graphics benchmark. One proposal is to enhance
the FPSQR hardware and speed up this operation by a
factor of 10. The other alternative is just to try to make all
FP instructions in the graphics processor run faster by a
factor of 1.6; FP instructions are responsible for a total of
50% of the execution time for the application. The design
team believes that they can make all FP instructions run 1.6
times faster with the same effort as required for the fast
square root. Compare these two design alternatives.
19
Solution – III
20
Exercise
Clock freq = 1.4 GHz
FP insturctionss = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
Design Option 1: decrease the CPI of FQSQRT to 2, clock
freq = 1.2GHz
Design Option 2: decease the average CPI of all FP
instructions to 2.5, clock freq = 1.1 GHz
21
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
Doesn’t account for
Differences in ISAs between computers
Differences in complexity between instructions
Instructio n count
MIPS
Execution time 10 6
Instructio n count Clock rate
Instructio n count CPI CPI 10 6
10 6
Clock rate
SPEC CPU2006
Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)
n
n
Execution time ratio
i 1
i
23
CINT2006 for Opteron X4 2356
Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7
24
SPEC Benchmark
Desktop Benchmarks
CPU-intensive benchmarks
SPEC89
SPEC92
SPEC95
SPEC2000
SPEC2006
graphics-intensive benchmarks
SPEC2000
SPECviewperf
o is used for benchmarking systems supporting the OpenGL graphics library
SPECapc
o consists of applications that make extensive use of graphics.
25
SPEC Benchmark
Server Benchmarks
SPECrate--processing rate of a multiprocessor
(SPECSFS)--file server benchmark
(SPECWeb)--Web server benchmark
Transaction-processing (TP) benchmarks
TPC benchmark—Transaction Processing Council
TPC-A, 1985
TPC-C, 1992,
TPC-H TPC-RTPC-W
26
SPEC Benchmark
Embedded Benchmarks
EDN Embedded Microprocessor Benchmark
Consortium (or EEMBC, pronounced “embassy”).
27
Power Consumption Trends
Power=Dynamic power+ Leakage power
• Dyn power∝activity capacitance×voltage2 ×frequency
• Capacitance per transistor and voltage are decreasing,
but number of transistors and frequency are increasing at a faster rate
• Leakage power is also rising and will soon match dynamic
power
Power consumption is already around 100W in some
high-performance processors today
28
Power wall
29
Fallacy: Low Power at Idle
Look back at X4 power benchmark
At 100% load: 295W
At 50% load: 246W (83%)
At 10% load: 180W (61%)
Google data center
Mostly operates at 10% – 50% load
At 100% load less than 1% of the time
Consider designing processors to make power
proportional to load
30
Performance Metrics
MIPS: Millions of Instructions Per Second
MFLOPS: Millions of floating point operations
per second.
http://ece-
research.unm.edu/jimp/611/slides/chap1_3.htm
l
31