Advanced Computer Architecture

The document discusses advanced computer architecture and ways to improve computer performance. It covers topics like RISC instruction sets, pipelining, instruction-level parallelism, cache performance optimization, and exploiting parallelism at various levels. It discusses quantitative measures for comparing architectural ideas and computer performance, such as execution time, throughput, and benchmarks. Amdahl's law is introduced as a way to calculate expected speedup from architectural enhancements.

Uploaded by

johnleons

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views18 pages

Advanced Computer Architecture

Uploaded by

johnleons

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 18

Advanced Computer Architecture

• We will consider issues • In CSC 362, we focused on

in current Architecture – the roles of the components in the
design and architecture
implementation: – the structure of the architecture (how
– RISC instruction sets things connect together)
– Pipelining • Here, we focus on
– Instruction-level – Using available technology to improve
parallelism computer performance
– Block-level parallelism – Using quantitative measures to test
– Thread-level parallelism architectural ideas
– Multiprocessors – Using a RISC instruction set for
– Improving cache examples
performance – Discussing a variety of software and
– Optimizing virtual hardware techniques to provide
memory usage optimization
– Attempting to force as much parallelism
out of the code as possible
Measuring Performance
• We might use one of the following terms to measure performance
– MIPS, MegaFLOPS
• neither of these terms tells us how the processor performs on the other type of
operation
– Clock Speed (GHZ rating)
• misleading as we will explore throughout the semester
– Execution time
• worthwhile on an unloaded system
– Throughput
• number of programs / unit time – useful for servers and large systems
– Wall-clock time
– CPU time, user CPU time, system CPU time
• CPU time = user CPU time + system CPU time
– System performance
• on an unloaded system
– note: CPU performance = 1 / execution time
• What does it mean that one computer is faster than another?
Meaning of Performance
• X is n times faster than Y means
– Exec time Y / Exec time X = n
– Perf X / Perf Y = n
• Example:
– if throughput of X is 1.3 times higher than Y
• then the number of tasks that can be executed on X is 1.3 more than on Y in the
same amount of time
• Example:
– X executes program p1 in .32 seconds
– Y executes program p1 in .38 seconds
– X is .38 / .32 = 1.19 times faster
• 19% faster
• To validly compare two computer’s performance, we must
compare performance on the same program
• Additionally, computers may have better performances on
different programs
– e.g., C1 runs P1 faster than C2 but C2 runs P2 faster than C1
– we might use weighted averages or geometric means, as well as
distributions, to derive a single processor’s overall performance (see pages
34-37 if you are interested)
Benchmarks
• A benchmark suite is a set of
• Four levels of programs can be used programs that test different
to test performance performance metrics
– Real programs – Example: test array
• e.g., C compiler, CAD tool capabilities, floating point
• programs with input, output, options that operations, loops,
the user selects – SPEC benchmark suites are
– Kernels commonly cited
• remove key pieces of programs and just – SPEC 96 is the most recent
test those benchmark, see figure 1.13
– Toy benchmarks on page 31
• 10-100 lines of code such as quicksort • Reporting benchmark results
whose performance is known in advance
must include
– Synthetic benchmarks
– compiler settings and
• try to match average frequency of
operations to simulate larger programs version
– input
• Only real programs are used today
– OS
– These others have been discredited since
computer architects and compiler writers – number/size of disks
will optimize systems to perform well on • Results must be
these specific benchmarks/kernels reproducible
Principles of Computer Design
• As computer architecture research has progressed,
several key design concepts have been identified
– The goal today is to further exploit each of these because they
provide a great deal of performance speed up
– We will examine these and use a quantitative approach to
identify the extent of the speedup
• Take advantage of parallelism
– Using multiple hardware components (ALU functional units, memory
modules, register ports, disk drives, etc) we can attempt to execute
instructions and threads in parallel
• Principle of locality of reference
– Used to design memory systems so that we can attempt to keep in cache
the data and instructions that will most likely be referenced soon
• Focus on the common case
– As we see next, if we can achieve a small speedup for executing the
common case, it is better than achieving a large speedup for an
uncommon case
Amdahl’s Law
• In order to explore architectural • This law uses two factors:
improvements, we need a mechanism – Fraction of the computation
to gage the speedup of our time in the original machine
improvements that can be converted to
take advantage of the
• Amdahl’s Law allows us to compute enhancement (F)
speedup that can be gained by using a – Improvement gained by the
particular feature as follows enhanced execution mode
• Given an enhancement E (how much faster will the
task run if the enhanced
– Speedup = performance with E / mode is used for the entire
performance without E program?) (S)
or
– Speedup = execution time without E /
execution time with E
Speedup =
1 / (1 – F + F / S)
• Example 1: Examples
– Web server is to be enhanced
• new CPU is 10 times faster on computation than old CPU
• the original CPU spent 40% of its time processing and 60% of its time waiting
for I/O
– What will the speedup be?
• Fraction enhancement used = 40%
• Speedup in enhanced mode = 10
• Speedup = 1 / [(1 - .4) + .4/10] = 1.56
• Example 2:
– A benchmark consists of:
• 20% FP sqrt
• 50% FP operations (including sqrt)
• 50% other operations
– Enhancement options are:
• add FP sqrt hardware to speed up sqrt performance by a factor of 10
• enhance all FP operations by a factor of 1.6
– Speedup FP sqrt = 1/[(1-.2) + .2/10] = 1.22
– Speedup all FP = 1/[(1-.5) + .5/1.6] = 1.23
– The enhancement to support the common case is (slightly) better
CPU Performance Formulae
• CPU time = CPU clock cycles * clock cycle time
– CPU clock cycles – the number of clock cycles that elapse during the
execution of the given program
– clock cycle time is the reciprocal of the clock rate – that is, how much
time elapses for one clock cycle, which gives us:
• CPU time = CPU clock cycles for prog / clock rate
• CPU time = IC * CPI * Clock cycle time
– IC - instruction count (number of instructions)
– CPI - clock cycles per instruction
– IC * CPI = CPU clock cycles
• CPI = CPU clock cycles / IC
• CPU time = ( CPIi * ICi) * clock cycle time
• Average CPI = (CPIi * ICi) / Total Instruction Count
– In the latter equation, CPIi and ICi are for each type of operation (for
instance, the CPI and number of adds, the CPI and number of loads, …)
Example
• Assume:
– frequency of FP operations = 25% (other than sqrt) and
frequency of FP sqrt = 2%
– average CPI of FP operations = 4.0, CPI of FP sqrt = 20
– average CPI other instr = 1.33
– CPI = 4*25%+1.33*75% = 2.0
• Two alternatives:
– reduce CPI of FP sqrt to 2 or
– reduce average CPI of all FP ops (including sqrt) to 2.5
• CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64
• CPI new FP = 75%*1.33+25%*2.5=1.625
– Speedup new FP = CPI original/CPI new FP =1.64 / 1.625 =
1.23
Computing Speedup – which formula?
• We can compute speedup by
– determining the difference in CPU time before and after an enhancement
– or by using Amdahl’s Law
• Which should we use?
– the formulas are the same
– lets demonstrate this with an example:
• Benchmark consists of 35% loads, 15% stores, 40% ALU
operations and 10% branches
– CPI for each instruction is 5 for loads and stores and 4 for ALU and
branches (since this is an integer benchmark, the floating point registers
are not used)
– Consider that we could keep more values in registers by moving them to
floating point registers rather than storing and then reloading these values
in memory
• Let’s have the compiler replace some of the loads/stores with
register moves
– this enhancement is done by the compiler, so costs us nothing!
– assuming that the compiler can reduce 20% of the loads from the
program, how worthwhile is it?
Solution
• We change some loads/stores to ALU operations
– so overall CPI goes down, IC remains the same
• Solution 1: compute CPU Time differences
– CPU Time = IC * CPI * CPU Clock Rate
– CPIold = 50% * 5 + 50% * 4 = 4.5
– CPInew = 40% * 5 + 60% * 4 = 4.4
– Since IC and CPU Clock Rate have not changed, speedup is only CPIold /
CPInew
– Speedup = 4.5 / 4.4 = 1.0227 = 2.27% speedup
• Solution 2: Amdahl’s Law
– Speedup of enhanced mode is from 5 cycles to 4 cycles or 5/4 = 1.25
– Fraction used = fraction of the execution time where we use conversions
instead of loads/stores
• overall CPI is 4.5
• enhancement used on 20% of loads/stores
• 20% * 50% * 5 = .5 clock cycles out of 4.5, or .5 / 4.5 = 11.1% of the time
– Amdahl’s Law = 1 / [1 – F + F / S] = 1 / [1 - .111 + .111 / 1.25] =
1 / .9778 = 1.0227 = 2.27% speedup
Why MIPS Can Be Misleading
• Assume a load-store machine with a • MIPS = IC / (Execution Time * 106)
breakdown of – exec time = IC * CPI / Clock Cycle rate
– 43% ALU
• so, MIPS = clock rate / (CPI * 106)
– 21% load/store
– 24% branch • CPIunoptimized = 1.57
– CPI = 1 for ALU operations • MIPSunoptimized = 500 MHz / (1.57 * 106) =
– CPI = 2 for all other operations 318.5
– Optimizing compiler is able to discard
50% of ALU operations • CPIoptimized = (.43 / 2 * 1 + .57 * 1) / (1 – .
• Ignoring system issues, if the machine 43 / 2) = 1.73
has a 2 nanosecond clock cycle (500 • MIPSoptimized = 500 MHz / (1.73 * 106) =
MHz) and 1.57 unoptimized CPI, 289.0
– what is the MIPS rating for the optimized – The optimized program will execute
and unoptimized versions? does the MIPS faster because it has fewer instructions,
value agree with the execution time?
but its CPI is larger because a greater
portion of the instructions have a higher
CPI, and therefore its MIPS rating is
lower
• So, MIPS and execution time are not
directly related!
Sample Problem #1
• Consider adding register-memory ALU instructions to a machine
that previously only permitted register-register ALU operations
• Assume a benchmark with the following breakdown of
operations is used to test this enhancement:
– ALU operations: 43%, CPI = 1
– Loads: 21%, CPI = 2
– Stores: 12%, CPI = 2
– Branches: 24%, CPI = 2
• The new ALU register-memory operation has the following
consequences:
– ALU register-memory operations have CPI = 2 and Branches now
have a CPI = 3
• But, 25% of data loaded are only used once so that the new ALU
register-memory instruction can be used in place of the load +
ALU operation
• Is it worth it?
Solution
• CPIold = .43 * 1 + .57 * 2 = 1.57 • CPInew = .11 * 2 + .13 * 2
– 3 changes: + .27 * 3 + .48 * (.25 * 2
• some ALU operations use new + .75 * 1) = 1.89
mode which changes their CPI
• fewer loads
• CPU Time = IC * CPI *
• all branches have higher CPI Clock Cycle Rate
– We have a new distribution: – Clock Cycle Rate remains
unchanged
• 25% of ALU operations become
ALU-memory operations – CPI has been recomputed
– 25% * 43% = 11%, so we – IC in the new system is
remove this many loads giving 89% of the old system
us 89% as many instructions as
previously – CPUold = IC * 1.57 * CCR
• Loads: [21% - (25% * 43%) ] / – CPUnew = .89 * IC * 1.89 *
89% = 11% CCR
• Stores: 12% / 89% = 13% – Speedup = 1.57 / (.89 *
• ALU operations: 43% / 89% = 1.89) = .934
48%
• this is a slowdown, so this
• Branches: 24% / 89% = 27% enhancement is not an
improvement!
Sample Problem #2
• Assume a machine with a • CPIperfectcachemachine = .43 * 1
perfect cache + .57 * 2 = 1.57
– And the following – Because of cache misses, we have to
instruction mix breakdown: compute the CPI for all new
• ALU: 43%, CPI 1 instructions based on misses during
• Loads: 21%, CPI 2 instruction fetch (5%) and misses
• Stores: 12%, CPI 2 during data accesses (10%) where a
• Branches: 24%, CPI 2 miss adds 40 cycles to the CPI
– An imperfect cache has a • CPIimperfectcachemachine = .43 * (1 + .05 *
miss rate of 5% for
40) + .21 * (2 + .05 * 40 + .10 *
instructions and 10% for
data and a miss penalty of 40) + .12 * (2 + .05 * 40 + .10 *
40 cycles 40) + .24 * (2 + .05 * 40) = 4.89
• How much faster is the • Perfect machine = 4.89 / 1.57 =
machine with the perfect 3.11 times faster
cache?
Sample Problem #3
• Architects are considering one of two enhancements
for their processor
– #1 can be used 20% of the time and offers a speedup of 3
– #2 offers a speedup of 7
• What fraction of the time will the second enhancement
have to be used in order to achieve the same overall
speedup as the first enhancement?
– speedup from #1 = 1 / [(1 - .2) + .2 / 3] = 1.154
• So, for the second enhancement to match, we have
1.154 = 1 / [(1 – x) + x / 7] and we must solve for x
– using some algebra, we get:
– 1.154 = 1 / (1 – 7x / 7 + x / 7) = 1 / (1 – 6x / 7) = 1 / (7 – 6x)
/ 7 = 7 / (7 – 6x) or 7 – 6x = 7 / 1.154  6x = 7 – 7 / 1.154 =
0.934, or x = 0.934 / 6 = 0.156.
Sample Problem #4
• We will compare a CISC machine and a RISC
machine on a benchmark
– The machines have the following characteristics
• CISC machine has CPIs of
– 4 for load/store, 3 for ALU/branch, 10 for call/return
– CPU clock rate of 1.75 GHz
• RISC machine has a CPI of 1.2 (as it is pipelined) and a CPU clock
rate of 1 GHz
• CISC machine uses complex instructions so the CISC version of the
benchmark is 40% less than the same benchmark on the RISC
machine (that is, CISC IC is 40% less than RISC IC)
– The benchmark has a breakdown of:
• 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns,
and 11% branches
– Which machine will run the benchmark in less time?
Solution
• We compare the CPU time for both machines
– CPU time = IC * CPI / Clock rate
• Since both machines have GHz in their clock rate, to simplify, we
will drop the GHz value
• CISC machine:
– First, compute the CISC machine’s CPI given the individual CPI for the
machine and the benchmark’s breakdown of instructions:
• CPI = 4 * (.38 + .10) + 3 * (.35 + .11) + 10 * (.03 + .03) = 3.9
– CPU time CISC = IC CISC * 3.9 / 1.75
• RISC machine:
– IC * 1.2 / 1 = IC RISC * 1.2
– Recall that the CISC machine has 40% fewer instructions, so IC CISC = .6 *
IC RISC
• CPU time CISC = .6 * IC RISC * 3.9 / 1.75 = 1.34 IC RISC
• CPU time RISC = 1.2 IC RISC
• Since the RISC CPU time is smaller, it is faster by 1.34 / 1.2 =
1.12 or 12% faster

2 RISC V Performance ISA
No ratings yet
2 RISC V Performance ISA
72 pages
Cs23402 - Computer Architecture - Unit - 1
No ratings yet
Cs23402 - Computer Architecture - Unit - 1
161 pages
Cse - 321 - 2
No ratings yet
Cse - 321 - 2
37 pages
Performance
No ratings yet
Performance
51 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Computer Architecture Measurement
No ratings yet
Computer Architecture Measurement
26 pages
L14 Introduction To Performance Evaluation
No ratings yet
L14 Introduction To Performance Evaluation
48 pages
Module 2 (26-10-2024)
No ratings yet
Module 2 (26-10-2024)
50 pages
Performance Measures For Computers
No ratings yet
Performance Measures For Computers
53 pages
Lecture 02 CH01 Performance Power
No ratings yet
Lecture 02 CH01 Performance Power
76 pages
CSC232 - Chp1 (Compatibility Mode)
No ratings yet
CSC232 - Chp1 (Compatibility Mode)
50 pages
Quatitative Principle
No ratings yet
Quatitative Principle
56 pages
Computer Performance
No ratings yet
Computer Performance
35 pages
4 Performance
No ratings yet
4 Performance
27 pages
Computer Performance
No ratings yet
Computer Performance
27 pages
Lecture 3: Performance/Power, MIPS Instructions
No ratings yet
Lecture 3: Performance/Power, MIPS Instructions
18 pages
2 CPU Performance
No ratings yet
2 CPU Performance
35 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
09 Perf
No ratings yet
09 Perf
22 pages
02 Performance
No ratings yet
02 Performance
23 pages
Puter Performance
No ratings yet
Puter Performance
15 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
17 pages
Introduction To Computer Organization
No ratings yet
Introduction To Computer Organization
66 pages
ACA Lec2 New
No ratings yet
ACA Lec2 New
44 pages
Lec 3
No ratings yet
Lec 3
21 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
Take PHP Quiz & Online Test To Test Your Knowledge
0% (1)
Take PHP Quiz & Online Test To Test Your Knowledge
8 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
Lecture 07 - Performance Measurements - Single and Multiple Cycle Processor Designs
No ratings yet
Lecture 07 - Performance Measurements - Single and Multiple Cycle Processor Designs
53 pages
Computer Performance
No ratings yet
Computer Performance
17 pages
PLC Training
No ratings yet
PLC Training
39 pages
CS-3006 10 PerformanceAnalysis
No ratings yet
CS-3006 10 PerformanceAnalysis
52 pages
Computer Architecture
No ratings yet
Computer Architecture
26 pages
2 - Computer Organization and Architecture
No ratings yet
2 - Computer Organization and Architecture
21 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
Computer Architecture Unit 1 - Phase 2 PDF
No ratings yet
Computer Architecture Unit 1 - Phase 2 PDF
26 pages
Lecture 2: Performance/Power, MIPS Instructions
No ratings yet
Lecture 2: Performance/Power, MIPS Instructions
28 pages
Computer Architecture Unit 1
No ratings yet
Computer Architecture Unit 1
12 pages
Intro
No ratings yet
Intro
14 pages
Computer Performance
No ratings yet
Computer Performance
22 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
Computer Organization and Architecture (AT70.01)
No ratings yet
Computer Organization and Architecture (AT70.01)
29 pages
Computer Architecture Measuring Performance
No ratings yet
Computer Architecture Measuring Performance
33 pages
Lec10 Performance
No ratings yet
Lec10 Performance
22 pages
Chapter 1 Lecture 2 & 3 - Computer Performance
No ratings yet
Chapter 1 Lecture 2 & 3 - Computer Performance
37 pages
Lec 2
No ratings yet
Lec 2
31 pages
1aca L1
No ratings yet
1aca L1
35 pages
Performance: Latency
No ratings yet
Performance: Latency
7 pages
Comp Org Notes On Measuring Cpu Performance
No ratings yet
Comp Org Notes On Measuring Cpu Performance
4 pages
Chapter 1 Performance
No ratings yet
Chapter 1 Performance
32 pages
Lec 2
No ratings yet
Lec 2
31 pages
COD Ch. 2 The Role of Performance
No ratings yet
COD Ch. 2 The Role of Performance
28 pages
CH 02a-Computer Performance
No ratings yet
CH 02a-Computer Performance
22 pages
513 Lec 02 Quantifying Performance
No ratings yet
513 Lec 02 Quantifying Performance
50 pages
Module 3.3 - Problems On Performance
No ratings yet
Module 3.3 - Problems On Performance
54 pages
Inroduction and Performance Analysis
No ratings yet
Inroduction and Performance Analysis
29 pages
Computer Architecture Unit1
No ratings yet
Computer Architecture Unit1
20 pages
Lecture Ch4 Performance
No ratings yet
Lecture Ch4 Performance
25 pages
Chapter 1 Lecture 2 & 3 - Performance
No ratings yet
Chapter 1 Lecture 2 & 3 - Performance
36 pages
06 CA (Performance Enhancement)
No ratings yet
06 CA (Performance Enhancement)
31 pages
Fundamentals of Computer Design: Bina Ramamurthy CS506
No ratings yet
Fundamentals of Computer Design: Bina Ramamurthy CS506
25 pages
Design and Construction of 20 K VA Automatic Voltage Stabilizer Control System
No ratings yet
Design and Construction of 20 K VA Automatic Voltage Stabilizer Control System
11 pages
Software-Testing-Life-Cycle PPT 2 in Unit1
No ratings yet
Software-Testing-Life-Cycle PPT 2 in Unit1
13 pages
BigM Method
No ratings yet
BigM Method
8 pages
Implementation of A Chat Bot System Using AI and NLP: Research
No ratings yet
Implementation of A Chat Bot System Using AI and NLP: Research
6 pages
Computer Architure Objective Type Questions
No ratings yet
Computer Architure Objective Type Questions
9 pages
s3 Userguide
No ratings yet
s3 Userguide
1,167 pages
Exchange The Contents of Memory Locations
No ratings yet
Exchange The Contents of Memory Locations
4 pages
Networking 3
No ratings yet
Networking 3
69 pages
Creating Custom Advanced Workflows in Alfresco - ECM Architect - Alfresco Developer Tutorials
No ratings yet
Creating Custom Advanced Workflows in Alfresco - ECM Architect - Alfresco Developer Tutorials
34 pages
SADCW 7e Chapter07
No ratings yet
SADCW 7e Chapter07
30 pages
Practical File
No ratings yet
Practical File
22 pages
TC2985en-Ed01 SIP Trunk Solution Planetel-TopoC-IT Configuration Guideline For OXO Connect ONE051
No ratings yet
TC2985en-Ed01 SIP Trunk Solution Planetel-TopoC-IT Configuration Guideline For OXO Connect ONE051
26 pages
UNIT-2 IoT
No ratings yet
UNIT-2 IoT
11 pages
Section 5 Quiz
No ratings yet
Section 5 Quiz
7 pages
Ficha Tecnica RM045
No ratings yet
Ficha Tecnica RM045
1 page
Voice Based Intelligent Virtual Assistance For Windows Artificial Intelligence Project
No ratings yet
Voice Based Intelligent Virtual Assistance For Windows Artificial Intelligence Project
5 pages
ITLSA1-22 Week7 Linux Processes
No ratings yet
ITLSA1-22 Week7 Linux Processes
14 pages
Archer MR400 (EU) - V1 - QIG
No ratings yet
Archer MR400 (EU) - V1 - QIG
2 pages
SAFRAN TimeCard DS
No ratings yet
SAFRAN TimeCard DS
3 pages
(Presentation) 1-1
No ratings yet
(Presentation) 1-1
3 pages
GR 10 Ict Pra Exam Eng
No ratings yet
GR 10 Ict Pra Exam Eng
4 pages
Lab-2 3
No ratings yet
Lab-2 3
3 pages
Python Unit 1
No ratings yet
Python Unit 1
9 pages
Censys ASM Datasheet
No ratings yet
Censys ASM Datasheet
2 pages
RKPL 2019 - Introduction To SE (Pert. 2)
No ratings yet
RKPL 2019 - Introduction To SE (Pert. 2)
37 pages
Programming Language: Core Java
No ratings yet
Programming Language: Core Java
6 pages
A Risc MIPS, Mflops
No ratings yet
A Risc MIPS, Mflops
16 pages
Solving Recurrences
No ratings yet
Solving Recurrences
30 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Advanced Computer Architecture

Uploaded by

Advanced Computer Architecture

Uploaded by

Advanced Computer Architecture

• We will consider issues • In CSC 362, we focused on

You might also like