0% found this document useful (0 votes)
90 views18 pages

Advanced Computer Architecture

The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.

Uploaded by

johnleons
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views18 pages

Advanced Computer Architecture

The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.

Uploaded by

johnleons
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

Advanced Computer Architecture

• We will consider issues • In CSC 362, we focused on


in current Architecture – the roles of the components in the
design and architecture
implementation: – the structure of the architecture (how
– RISC instruction sets things connect together)
– Pipelining • Here, we focus on
– Instruction-level – Using available technology to improve
parallelism computer performance
– Block-level parallelism – Using quantitative measures to test
– Thread-level parallelism architectural ideas
– Multiprocessors – Using a RISC instruction set for
– Improving cache examples
performance – Discussing a variety of software and
– Optimizing virtual hardware techniques to provide
memory usage optimization
– Attempting to force as much parallelism
out of the code as possible
Measuring Performance
• We might use one of the following terms to measure performance
– MIPS, MegaFLOPS
• neither of these terms tells us how the processor performs on the other type of
operation
– Clock Speed (GHZ rating)
• misleading as we will explore throughout the semester
– Execution time
• worthwhile on an unloaded system
– Throughput
• number of programs / unit time – useful for servers and large systems
– Wall-clock time
– CPU time, user CPU time, system CPU time
• CPU time = user CPU time + system CPU time
– System performance
• on an unloaded system
– note: CPU performance = 1 / execution time
• What does it mean that one computer is faster than another?
Meaning of Performance
• X is n times faster than Y means
– Exec time Y / Exec time X = n
– Perf X / Perf Y = n
• Example:
– if throughput of X is 1.3 times higher than Y
• then the number of tasks that can be executed on X is 1.3 more than on Y in the
same amount of time
• Example:
– X executes program p1 in .32 seconds
– Y executes program p1 in .38 seconds
– X is .38 / .32 = 1.19 times faster
• 19% faster
• To validly compare two computer’s performance, we must
compare performance on the same program
• Additionally, computers may have better performances on
different programs
– e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1
– we might use weighted averages or geometric means, as well as
distributions, to derive a single processor’s overall performance (see pages
34-37 if you are interested)
Benchmarks
• A benchmark suite is a set of
• Four levels of programs can be used programs that test different
to test performance performance metrics
– Real programs – Example: test array
• e.g., C compiler, CAD tool capabilities, floating point
• programs with input, output, options that operations, loops,
the user selects – SPEC benchmark suites are
– Kernels commonly cited
• remove key pieces of programs and just – SPEC 96 is the most recent
test those benchmark, see figure 1.13
– Toy benchmarks on page 31
• 10-100 lines of code such as quicksort • Reporting benchmark results
whose performance is known in advance
must include
– Synthetic benchmarks
– compiler settings and
• try to match average frequency of
operations to simulate larger programs version
– input
• Only real programs are used today
– OS
– These others have been discredited since
computer architects and compiler writers – number/size of disks
will optimize systems to perform well on • Results must be
these specific benchmarks/kernels reproducible
Principles of Computer Design
• As computer architecture research has progressed,
several key design concepts have been identified
– The goal today is to further exploit each of these because they
provide a great deal of performance speed up
– We will examine these and use a quantitative approach to
identify the extent of the speedup
• Take advantage of parallelism
– Using multiple hardware components (ALU functional units, memory
modules, register ports, disk drives, etc) we can attempt to execute
instructions and threads in parallel
• Principle of locality of reference
– Used to design memory systems so that we can attempt to keep in cache
the data and instructions that will most likely be referenced soon
• Focus on the common case
– As we see next, if we can achieve a small speedup for executing the
common case, it is better than achieving a large speedup for an
uncommon case
Amdahl’s Law
• In order to explore architectural • This law uses two factors:
improvements, we need a mechanism – Fraction of the computation
to gage the speedup of our time in the original machine
improvements that can be converted to
take advantage of the
• Amdahl’s Law allows us to compute enhancement (F)
speedup that can be gained by using a – Improvement gained by the
particular feature as follows enhanced execution mode
• Given an enhancement E (how much faster will the
task run if the enhanced
– Speedup = performance with E / mode is used for the entire
performance without E program?) (S)
or
– Speedup = execution time without E /
execution time with E
Speedup =
1 / (1 – F + F / S)
• Example 1: Examples
– Web server is to be enhanced
• new CPU is 10 times faster on computation than old CPU
• the original CPU spent 40% of its time processing and 60% of its time waiting
for I/O
– What will the speedup be?
• Fraction enhancement used = 40%
• Speedup in enhanced mode = 10
• Speedup = 1 / [(1 - .4) + .4/10] = 1.56
• Example 2:
– A benchmark consists of:
• 20% FP sqrt
• 50% FP operations (including sqrt)
• 50% other operations
– Enhancement options are:
• add FP sqrt hardware to speed up sqrt performance by a factor of 10
• enhance all FP operations by a factor of 1.6
– Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22
– Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23
– The enhancement to support the common case is (slightly) better
CPU Performance Formulae
• CPU time = CPU clock cycles * clock cycle time
– CPU clock cycles – the number of clock cycles that elapse during the
execution of the given program
– clock cycle time is the reciprocal of the clock rate – that is, how much
time elapses for one clock cycle, which gives us:
• CPU time = CPU clock cycles for prog / clock rate
• CPU time = IC * CPI * Clock cycle time
– IC - instruction count (number of instructions)
– CPI - clock cycles per instruction
– IC * CPI = CPU clock cycles
• CPI = CPU clock cycles / IC
• CPU time = ( CPIi * ICi) * clock cycle time
• Average CPI = (CPIi * ICi) / Total Instruction Count
– In the latter equation, CPIi and ICi are for each type of operation (for
instance, the CPI and number of adds, the CPI and number of loads, …)
Example
• Assume:
– frequency of FP operations = 25% (other than sqrt) and
frequency of FP sqrt = 2%
– average CPI of FP operations = 4.0, CPI of FP sqrt = 20
– average CPI other instr = 1.33
– CPI = 4*25%+1.33*75% = 2.0
• Two alternatives:
– reduce CPI of FP sqrt to 2 or
– reduce average CPI of all FP ops (including sqrt) to 2.5
• CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64
• CPI new FP = 75%*1.33+25%*2.5=1.625
– Speedup new FP = CPI original/CPI new FP =1.64 / 1.625 =
1.23
Computing Speedup – which formula?
• We can compute speedup by
– determining the difference in CPU time before and after an enhancement
– or by using Amdahl’s Law
• Which should we use?
– the formulas are the same
– lets demonstrate this with an example:
• Benchmark consists of 35% loads, 15% stores, 40% ALU
operations and 10% branches
– CPI for each instruction is 5 for loads and stores and 4 for ALU and
branches (since this is an integer benchmark, the floating point registers
are not used)
– Consider that we could keep more values in registers by moving them to
floating point registers rather than storing and then reloading these values
in memory
• Let’s have the compiler replace some of the loads/stores with
register moves
– this enhancement is done by the compiler, so costs us nothing!
– assuming that the compiler can reduce 20% of the loads from the
program, how worthwhile is it?
Solution
• We change some loads/stores to ALU operations
– so overall CPI goes down, IC remains the same
• Solution 1: compute CPU Time differences
– CPU Time = IC * CPI * CPU Clock Rate
– CPIold = 50% * 5 + 50% * 4 = 4.5
– CPInew = 40% * 5 + 60% * 4 = 4.4
– Since IC and CPU Clock Rate have not changed, speedup is only CPIold /
CPInew
– Speedup = 4.5 / 4.4 = 1.0227 = 2.27% speedup
• Solution 2: Amdahl’s Law
– Speedup of enhanced mode is from 5 cycles to 4 cycles or 5/4 = 1.25
– Fraction used = fraction of the execution time where we use conversions
instead of loads/stores
• overall CPI is 4.5
• enhancement used on 20% of loads/stores
• 20% * 50% * 5 = .5 clock cycles out of 4.5, or .5 / 4.5 = 11.1% of the time
– Amdahl’s Law = 1 / [1 – F + F / S] = 1 / [1 - .111 + .111 / 1.25] =
1 / .9778 = 1.0227 = 2.27% speedup
Why MIPS Can Be Misleading
• Assume a load-store machine with a • MIPS = IC / (Execution Time * 106)
breakdown of – exec time = IC * CPI / Clock Cycle rate
– 43% ALU
• so, MIPS = clock rate / (CPI * 106)
– 21% load/store
– 24% branch • CPIunoptimized = 1.57
– CPI = 1 for ALU operations • MIPSunoptimized = 500 MHz / (1.57 * 106) =
– CPI = 2 for all other operations 318.5
– Optimizing compiler is able to discard
50% of ALU operations • CPIoptimized = (.43 / 2 * 1 + .57 * 1) / (1 – .
• Ignoring system issues, if the machine 43 / 2) = 1.73
has a 2 nanosecond clock cycle (500 • MIPSoptimized = 500 MHz / (1.73 * 106) =
MHz) and 1.57 unoptimized CPI, 289.0
– what is the MIPS rating for the optimized – The optimized program will execute
and unoptimized versions? does the MIPS faster because it has fewer instructions,
value agree with the execution time?
but its CPI is larger because a greater
portion of the instructions have a higher
CPI, and therefore its MIPS rating is
lower
• So, MIPS and execution time are not
directly related!
Sample Problem #1
• Consider adding register-memory ALU instructions to a machine
that previously only permitted register-register ALU operations
• Assume a benchmark with the following breakdown of
operations is used to test this enhancement:
– ALU operations: 43%, CPI = 1
– Loads: 21%, CPI = 2
– Stores: 12%, CPI = 2
– Branches: 24%, CPI = 2
• The new ALU register-memory operation has the following
consequences:
– ALU register-memory operations have CPI = 2 and Branches now
have a CPI = 3
• But, 25% of data loaded are only used once so that the new ALU
register-memory instruction can be used in place of the load +
ALU operation
• Is it worth it?
Solution
• CPIold = .43 * 1 + .57 * 2 = 1.57 • CPInew = .11 * 2 + .13 * 2
– 3 changes: + .27 * 3 + .48 * (.25 * 2
• some ALU operations use new + .75 * 1) = 1.89
mode which changes their CPI
• fewer loads
• CPU Time = IC * CPI *
• all branches have higher CPI Clock Cycle Rate
– We have a new distribution: – Clock Cycle Rate remains
unchanged
• 25% of ALU operations become
ALU-memory operations – CPI has been recomputed
– 25% * 43% = 11%, so we – IC in the new system is
remove this many loads giving 89% of the old system
us 89% as many instructions as
previously – CPUold = IC * 1.57 * CCR
• Loads: [21% - (25% * 43%) ] / – CPUnew = .89 * IC * 1.89 *
89% = 11% CCR
• Stores: 12% / 89% = 13% – Speedup = 1.57 / (.89 *
• ALU operations: 43% / 89% = 1.89) = .934
48%
• this is a slowdown, so this
• Branches: 24% / 89% = 27% enhancement is not an
improvement!
Sample Problem #2
• Assume a machine with a • CPIperfectcachemachine = .43 * 1
perfect cache + .57 * 2 = 1.57
– And the following – Because of cache misses, we have to
instruction mix breakdown: compute the CPI for all new
• ALU: 43%, CPI 1 instructions based on misses during
• Loads: 21%, CPI 2 instruction fetch (5%) and misses
• Stores: 12%, CPI 2 during data accesses (10%) where a
• Branches: 24%, CPI 2 miss adds 40 cycles to the CPI
– An imperfect cache has a • CPIimperfectcachemachine = .43 * (1 + .05 *
miss rate of 5% for
40) + .21 * (2 + .05 * 40 + .10 *
instructions and 10% for
data and a miss penalty of 40) + .12 * (2 + .05 * 40 + .10 *
40 cycles 40) + .24 * (2 + .05 * 40) = 4.89
• How much faster is the • Perfect machine = 4.89 / 1.57 =
machine with the perfect 3.11 times faster
cache?
Sample Problem #3
• Architects are considering one of two enhancements
for their processor
– #1 can be used 20% of the time and offers a speedup of 3
– #2 offers a speedup of 7
• What fraction of the time will the second enhancement
have to be used in order to achieve the same overall
speedup as the first enhancement?
– speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154
• So, for the second enhancement to match, we have
1.154 = 1 / [(1 – x) + x / 7] and we must solve for x
– using some algebra, we get:
– 1.154 = 1 / (1 – 7x / 7 + x / 7) = 1 / (1 – 6x / 7) = 1 / (7 – 6x)
/ 7 = 7 / (7 – 6x) or 7 – 6x = 7 / 1.154  6x = 7 – 7 / 1.154 =
0.934, or x = 0.934 / 6 = 0.156.
Sample Problem #4
• We will compare a CISC machine and a RISC
machine on a benchmark
– The machines have the following characteristics
• CISC machine has CPIs of
– 4 for load/store, 3 for ALU/branch, 10 for call/return
– CPU clock rate of 1.75 GHz
• RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock
rate of 1 GHz
• CISC machine uses complex instructions so the CISC version of the
benchmark is 40% less than the same benchmark on the RISC
machine (that is, CISC IC is 40% less than RISC IC)
– The benchmark has a breakdown of:
• 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns,
and 11% branches
– Which machine will run the benchmark in less time?
Solution
• We compare the CPU time for both machines
– CPU time = IC * CPI / Clock rate
• Since both machines have GHz in their clock rate, to simplify, we
will drop the GHz value
• CISC machine:
– First, compute the CISC machine’s CPI given the individual CPI for the
machine and the benchmark’s breakdown of instructions:
• CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9
– CPU time CISC = IC CISC * 3.9 / 1.75
• RISC machine:
– IC * 1.2 / 1 = IC RISC * 1.2
– Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 *
IC RISC
• CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC
• CPU time RISC = 1.2 IC RISC
• Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 =
1.12 or 12% faster

You might also like