0% found this document useful (0 votes)
177 views31 pages

06 CA (Performance Enhancement)

The document discusses performance enhancement in computer architecture. It explains that performance is determined by execution time, not just the number of instructions or cycles. It introduces Amdahl's law, which states that the overall speedup from an enhancement is limited by the fraction of time the original program spends running non-enhanced code. Several examples are provided to illustrate how to use Amdahl's law to evaluate potential performance improvements.

Uploaded by

Royal Stars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views31 pages

06 CA (Performance Enhancement)

The document discusses performance enhancement in computer architecture. It explains that performance is determined by execution time, not just the number of instructions or cycles. It introduces Amdahl's law, which states that the overall speedup from an enhancement is limited by the fraction of time the original program spends running non-enhanced code. Several examples are provided to illustrate how to use Amdahl's law to evaluate potential performance improvements.

Uploaded by

Royal Stars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Performance Enhancement

CS353 – Computer Architecture

Najeeb-Ur-Rehman
Assistant Professor
Department of Computer Science
Faculty of Computing & IT
University of Gujrat
# of Instructions Example
A compiler designer is trying to decide between
two code sequences for a particular machine.
Based on the hardware implementation, there
are three different classes of instructions: Class
A, Class B, and Class C, and they require one,
two, and three cycles (respectively).
The first code sequence has 5 instructions: 2 of
A, 1 of B, and 2 of C. The second sequence has 6
instructions: 4 of A, 1 of B, and 1 of C.
Which sequence will be faster? How much?
What is the CPI for each sequence?
2
Performance
 Performance is determined by execution time
 Do any of the other variables equal performance?
 # of cycles to execute program?
 # of instructions in program?
 # of cycles per second?
 average # of cycles per instruction?
 average # of instructions per second?

 Common pitfall: thinking one of the variables is


indicative of performance when it really isn’t.

3
Quantitative Principles
 Make common case fast
 Favor the frequent case (simpler) over the infrequent
case.
 For example, given that overflow in addition is
infrequent, favor optimizing the case when no
overflow occurs.
 Objective
 Determine the frequent case.
 Determine how much improvement in performance is
possible by making it faster.

Amdahl's law can be used to quantify the latter


given that we have information concerning the
former.
4
Amdahl’s law

 The performance improvement to be gained from


using some faster mode of execution is limited by
the fraction of the time the faster mode can be used.
 This implies that the time consumed by events
whose performance is not improved limits the
effect of any improvement.
 Lowest performer restricts all others.

5
Amdahl’s law….
The parameter to use in measuring the
effect of Amdahl's Law is speedup:
Performanc e using enhancemen t
Speedup 
Performanc e without using enhancemen t
or
Execution time without enhancemen t
Speedup 
Execution time with enhancemen t

6
Speedup depends on two factors
 The fraction of the computation time in the original
machine that can be converted to take advantage of
the enhancement

Fraction enhanced  1
 The improvement gained by the enhanced execution
mode
Speedup enhanced  1

7
Example
 Trip from point A to point B in two parts

A 20 C 50/20/4/1.7/0.3 B

A-C Trip C-B Trip Total Time C-B Speedup Overall Speedup
20 50 70 1 1
20 20 40 2.5 1.75
20 4 24 12.5 2.9
20 1.7 21.7 29.4 3.2
20 0.3 20.3 166.66 3.4

8
Amdahl’s law
Exec timenew = execution time after some enhancement
Exec timeold = execution time before any enhancement
Fractionenhanced = fraction of work using the enhancement
Speedupenhanced = speedup of enhanced mode

9
Conti…

10
Example – I
 We are considering an enhancement to
the processor of a web server. The new
CPU is 20 times faster on search queries
than the old processor. The old
processor is busy with search queries
70% of the time, what is the speedup
gained by integrating the enhanced
CPU?
11
Example – I Solution

12
Example – II
 Suppose that we are considering an
enhancement to the processor of a server
system used for web serving. The new CPU
is 10 times faster on computation in the Web
serving application that the original
processor. Assuming that the original CPU
is busy with computation 40% of the time
and is waiting for I/O 60% of the time, what
is the overall speedup gained by
incorporating the enhancement?
13
Example – II Solution

14
Amdahl’s law Example
 Consider an enhancement that takes 20ns on a
machine with enhancement and 100ns on a
machine without. Assume enhancement can only
be used 30% of the time.
 What is the overall speedup?

15
Corollary
 If an enhancement is only usable for a fraction of a
task, we can’t speed up the task by more than the
reciprocal of 1 minus that fraction.

1
Performanc e Improvemen t Limit 
1 - Fraction enhanced

16
Example
 Frequency of FP instructions : 25%
 Average CPI of FP instructions : 4.0
 Average CPI of other instructions : 1.33
 Frequency of FPSQR = 2%
 CPI of FPSQR = 20
 Design Alternative 1: Reduce CPI of FPSQR
from 20 to 2.
 Design Alternative 2: Reduce average CPI of all
FP instruction to 2.5
 Compare these two design alternatives using
CPU Performance equation.

17
Solution
Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0

Option 1 CPI = 2.0 – 2%*(20-2) = 1.64

Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) =


1.625

Speedup of Option 1 = 2/1.64 = 1.2195


Speedup of Option 2 = 2/1.625 = 1.2308

18
Example – III
A common transformation required in graphics engines is
square root. Implementations of floating-point (FP) square
root vary significantly in performance, especially among
processors designed for graphics. Suppose FP square root
(FPSQR) is responsible for 20% of the execution time of a
critical graphics benchmark. One proposal is to enhance
the FPSQR hardware and speed up this operation by a
factor of 10. The other alternative is just to try to make all
FP instructions in the graphics processor run faster by a
factor of 1.6; FP instructions are responsible for a total of
50% of the execution time for the application. The design
team believes that they can make all FP instructions run 1.6
times faster with the same effort as required for the fast
square root. Compare these two design alternatives.

19
Solution – III

20
Exercise
Clock freq = 1.4 GHz
FP insturctionss = 25%
Average CPI of FP instructions = 4.0
Average CPI of other instructions = 1.33
FPSQRT = 2%, CPI of FPSQRT = 20
 Design Option 1: decrease the CPI of FQSQRT to 2, clock
freq = 1.2GHz
 Design Option 2: decease the average CPI of all FP
instructions to 2.5, clock freq = 1.1 GHz

21
Pitfall: MIPS as a Performance Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
 Differences in ISAs between computers
 Differences in complexity between instructions

Instructio n count
MIPS 
Execution time  10 6
Instructio n count Clock rate
 
Instructio n count  CPI CPI  10 6
 10 6

Clock rate

 CPI varies between programs on a given CPU


22
SPEC CPU Benchmark
 Programs used to measure performance
 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …

 SPEC CPU2006
 Elapsed time to execute a selection of programs
 Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios
 CINT2006 (integer) and CFP2006 (floating-point)

n
n
 Execution time ratio
i 1
i

23
CINT2006 for Opteron X4 2356
Name Description IC×109 CPI Tc (ns) Exec time Ref time SPECratio
perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7

High cache miss rates

24
SPEC Benchmark
 Desktop Benchmarks
 CPU-intensive benchmarks
 SPEC89
 SPEC92
 SPEC95
 SPEC2000
 SPEC2006
 graphics-intensive benchmarks
 SPEC2000
 SPECviewperf
o is used for benchmarking systems supporting the OpenGL graphics library
 SPECapc
o consists of applications that make extensive use of graphics.

25
SPEC Benchmark
 Server Benchmarks
 SPECrate--processing rate of a multiprocessor
 (SPECSFS)--file server benchmark
 (SPECWeb)--Web server benchmark
 Transaction-processing (TP) benchmarks
 TPC benchmark—Transaction Processing Council
 TPC-A, 1985
 TPC-C, 1992,
 TPC-H TPC-RTPC-W

26
SPEC Benchmark
 Embedded Benchmarks
 EDN Embedded Microprocessor Benchmark
Consortium (or EEMBC, pronounced “embassy”).

27
Power Consumption Trends
 Power=Dynamic power+ Leakage power
• Dyn power∝activity capacitance×voltage2 ×frequency
• Capacitance per transistor and voltage are decreasing,
 but number of transistors and frequency are increasing at a faster rate
• Leakage power is also rising and will soon match dynamic
 power
 Power consumption is already around 100W in some
high-performance processors today

28
Power wall

 Power = K (Capacitive Load)·(Voltage)2·(Frequency Switched)

29
Fallacy: Low Power at Idle
 Look back at X4 power benchmark
 At 100% load: 295W
 At 50% load: 246W (83%)
 At 10% load: 180W (61%)
 Google data center
 Mostly operates at 10% – 50% load
 At 100% load less than 1% of the time
 Consider designing processors to make power
proportional to load

30
Performance Metrics
 MIPS: Millions of Instructions Per Second
 MFLOPS: Millions of floating point operations
per second.

 Topic 4.5 From Book

 http://ece-
research.unm.edu/jimp/611/slides/chap1_3.htm
l

31

You might also like