0% found this document useful (0 votes)

20 views60 pages

Computer Architecture Revision For Final Exam

Uploaded by

ester.xhh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views60 pages

Computer Architecture Revision For Final Exam

Uploaded by

ester.xhh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Revision Slides for Computer Architecture

- Maaruf Ali

1
Computer Architecture Slide Deck 5: Superscalar 2 and Exceptions

Causes of Exceptions
Interrupt: an event that requests the attention of the processor

• Asynchronous: an external event

– input/output device service request
– timer expiration
– power disruptions, hardware failure
• Synchronous: an internal exception (a.k.a.
exceptions/trap)
– undefined opcode, privileged instruction
– arithmetic overflow, FPU exception
– misaligned memory access
– virtual memory exceptions: page faults,
TLB misses, protection violations
– software exceptions: system calls, e.g., 2
Asynchronous Interrupts:
invoking the interrupt handler

• An I/O device requests attention by asserting

one of the prioritized interrupt request lines

• When the processor decides to process the

interrupt
– It stops the current program at instruction I , completing all the
i

instructions up to Ii-1 (a precise interrupt)

– It saves the PC of instruction I in a special register (EPC)

i
– It disables interrupts and transfers control to a designated interrupt
handler running in the kernel mode 3
Interrupt Handler
• Saves EPC before re-enabling interrupts to allow nested
interrupts 
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be saved
• Needs to read a status register that indicates the cause
of the interrupt
• Uses a special indirect jump instruction RFE (return-
from-exception) to resume user code, this:
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state

4
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction

• In general, the instruction cannot be completed and

needs to be restarted after the exception has been
handled
– requires undoing the effect of one or more partially executed instructions

• In the case of a system call trap, the instruction is

considered to have been completed
– syscall is a special jump instruction involving a change to privileged kernel
mode
– Handler resumes at instruction after system call

5
Exception Handling 5-Stage
Pipeline

Inst. Data
PC D Decode E + M W
Mem Mem

PC address Illegal Data address

Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

• How to handle multiple simultaneous exceptions in

different pipeline stages?
• How and where to handle external asynchronous
interrupts?
6
Exception Handling 5-Stage
Pipeline
Commit
Point

Inst. Data
PC D Decode E + M W
Mem Mem
Illegal Overflow Data address
PC address
Opcode Exceptions
Exception

Cause
Exc Exc Exc
D E M

PC PC PC

EP
Select

C
Handler Kill F Kill D Kill E Asynchronous Kill
D E M Interrupts Writeback
PC Stage Stage Stage

7
Exception Pipeline
Diagram
time
t0 t1 t2 t3 t4 t5 t6 t7
(I1) 096: ADD IF1 ID1 . . . . EX1 MA1 nop overflow!
(I2) 100: XOR IF2 ID2 EX2 nop nop
(I3) 104: SUB IF3 ID3 nop nop nop
(I4) 108: ADD IF4 nop nop nop nop
(I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7
....
IF I1 I2 I3 I4 nop I5
Resource
I5 ID I1 I2 I3 nop nop I5
Usage
EX I1 I2 nop nop nop I5
MA I1 nop nop nop nop I5
WB
8
Out-Of-Order (OOO)
Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer

 Frontend = Instruction Fetch and Decode stages

 IO = In Order
 OOO = Out Of Order
 Issue = all the operands are ready for execution
 Writeback = Write to Memory/Register
 Commit = instruction cannot be rollbacked but must be completed
 IO2I = In Order; Out Of Order; Out of Order; In Order
 Scoreboard - this a structure where we keep information about what instruction is ready execute.
 Reorder Buffer - typically, when we execute instructions out of order, this is place where we can go and
actually re-order them, to commit them in order. And we resolved all the dependencies so we don't actually
commit out of order.
9
OOO Motivating Code
Sequence
0 MUL R1, R2, R3 0 1
1 ADDIU
2 MUL R5, R1, R4
R11,R10,1 2 4
3 MUL R7, R5, R6
5 6
3
4 ADDIU R12,R11,1
5 ADDIU R13,R12,1
6 ADDIU R14,R12,2
 Two independent sequences of instructions enable
flexibility in terms of how instructions are scheduled in
total order
 We can schedule statically in software or dynamically
in hardware
 Instruction level parallelism (multiple instructions can
be executed in the same cycle when there is no10 data
Computer Architecture
ELE 475 / COS 475
Slide Deck 6:
Superscalar 3

1
1
Register
Renaming
• Adding more “Names” (registers/memory)
removes dependence, but architecture
namespace is limited.
– Registers: Larger namespace requires more bits in
instruction encoding. 32 registers = 5 bits,
128 registers = 7 bits.

• Register Renaming: Change naming of registers

in hardware to eliminate WAW and WAR hazards

12
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave IQ for execution
until all previous loads and stores have
completed execution

• Can still execute loads and stores speculatively,

and out-of-order with respect to other (non-
memory) instructions

• Need a structure to handle memory ordering…

13
Address Speculation
st R1, 0(R2)
ld R3,
0(R4)
• Guess that r4 != r2

• Execute load before store address known

• Need to hold all completed but uncommitted load/store

addresses in program order

• If subsequently find r4==r2, squash load and all following

instructions
=> Large penalty for inaccurate address speculation 37
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)

• Guess that r4 != r2 and execute load before

store
• If later find r4==r2, squash load and all
following instructions, but mark load
instruction as store-wait
• Subsequent executions of the same
load
instruction will wait for all previous stores to
complete
• Periodically clear store-wait bits
15
Computer Architecture
ELE 475 / COS 475
Slide Deck 7: VLIW
David Wentzlaff
Department of Electrical Engineering
Princeton
University
1
6
VLIW: Very Long Instruction
Word
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,

Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
• Multiple operations packed into one instruction
• Each operation slot is for a fixed function
• Constant operation latencies are specified
• Architecture requires guarantee of:
– Parallelism within an instruction => no cross-operation RAW check
– No data use before data ready => no data interlocks
17
Software Pipelining vs. Loop
Unrolling
Loop Unrolled Wind-down overhead
performance

Startup overhead

Loop Iteration time

Software Pipelined
performance

Loop time
Iteration
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
18
Trace Scheduling
[ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace

19
Problems with “Classic” VLIW
• Object-code compatibility
– have to recompile all code for every machine, even for two machines in same
generation
• Object code size
– instruction padding wastes instruction memory/cache
– loop unrolling/software pipelining replicates code
• Scheduling variable latency memory operations
– caches and/or memory bank conflicts impose statically unpredictable
variability
• Knowing branch probabilities
– Profiling requires an significant extra step in build process
• Scheduling for statically unpredictable branches
– optimal schedule varies with branch path
• Precise Interrupts can be challenging
– Does fault in one portion of bundle fault whole bundle?
– EQ Model has problem with single step, etc.
20
Code Motion
Before Code Motion After Code Motion
MUL R1, R2, R3 LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R5, R1, R4 MUL R1, R2, R3
MUL R7, R5, ADDIU R12,R11,1
R6
SW R7, 0(R16) MUL R5, R1, R4
ADDIU R12,R11,1 ADD R13,R12,R14
LW R14, 0(R9) MUL R7, R5, R6
ADD R13,R12,R14 ADD R14,R12,R13
ADD R14,R12,R13 SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target
Scheduling and Bundling
Before Bundling After Bundling
LW R14, 0(R9) {LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R1, R2, R3 MUL R1, R2, R3}
ADDIU R12,R11,1 {ADDIU R12,R11,1
MUL R5, R1, R4 MUL R5, R1, R4}
ADD R13,R12,R14 {ADD R13,R12,R14
MUL R7, R5, R6 MUL R7, R5, R6}
ADD R14,R12,R13 {ADD R14,R12,R13
SW R7, 0(R16) SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target}
Prologue
• In computer architecture, the
terms “prologue” and “epilogue” refer to specific sections
of code within a function. Prologue:
1.The function prologue is a set of instructions that
appear at the beginning of a function.
2.Its purpose is to prepare the stack and registers for use
within the function.
3.Key actions performed by the prologue include:
1.Saving any registers that the function might use
(registers that are required by the platform’s standard
to be preserved across function calls).
2.Setting up the stack to allocate space for local
variables.
3.Establishing a base pointer (or frame pointer) to track
the top of the stack.
4.The prologue ensures that the function has a clean slate to
work with.
Epilogue
 The function epilogue appears at the end of a
function.
 Its purpose is to restore the stack and registers to the
state they were in before the function was called.
 Key actions performed by the epilogue include:
 Dropping the stack pointer back to the current base
pointer, freeing the room reserved for local variables.
 Popping the base pointer off the stack, restoring it to
its value before the prologue.
 Returning control to the calling function by popping
the previous frame’s program counter off the stack
and jumping to it.
 Essentially, the epilogue cleans up after the function
execution.
The Need for Prolog and Epilog
 These prologue and epilogue sections
are essential for managing the
function’s context and ensuring proper
execution within the broader program.
 They are conventions used by assembly
language programmers and compilers
of higher-level languages
.
Register Rotation
 In computer architecture, register rotation refers
to a technique where the bits within a register
are circularly shifted around the two ends
without any loss of data or contents.
 In the context of shift registers, register
rotation involves circularly shifting the bits within
a register.
 The serial output of the shift register connects to
its serial input.
 Notably, CIL (Circular Shift Left) and CIR
(Circular Shift Right) instructions are used for circul
ar shifts left and right, respectively
.
Computer Architecture
Slide Deck 8: Branch
Prediction

2
7
Longer Pipeline Frontends
Amplify Branch Cost

Pentium 3: 10 cycle branch penalty

Pentium 4: 20 cycle branch penalty
Image from: The Microarchitecture of the Pentium 4 Processor by Glenn Hinton et al.
5
Appeared in Intel Technology Journal Q1, 2001. Image courtesy of Intel
Branch Prediction
• Essential in modern processors to mitigate
branch delay latencies

Two types of Prediction

1. Predict Branch Outcome
2. Predict Branch/Jump Address

29
Where is the Branch Information Known?

F D I X M
W Know branch outcome
Know target address for JR, JALR

Know target address for branches, J, JAL

30
Branch Delay Slots
(expose control hazard to software)

• Change the ISA semantics so that the instruction

that follows a jump or branch is always
executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.
I1 096 ADD
100 BEQZ r1 +200
I2 104 ADD Delay slot instructions executed
108 ADD regardless of branch outcome
I3 304 ADD

I5
31
Static Branch
Prediction
Overall probability a branch is taken is ~60-70% but:

BEZ
backward forward
90% 50%
BEZ

32
Static Hardware Branch Prediction
1. Always Predict Not-Taken
– What we have been assuming
– Simple to implement
– Know fall-through PC in Fetch
– Poor Accuracy, especially on backward branches
2. Always Predict Taken
– Difficult to implement because don’t know target until
Decode
– Poor accuracy on if-then-else
3. Backward Branch Taken, Forward Branch Not Taken
– Better Accuracy
– Difficult to implement because don’t know target until
Decode
33
Dynamic Hardware Branch
Prediction: Exploiting Temporal
Correlation
• Exploit structure in program: The way a
branch resolves may be a good indicator of
the way it will resolve the next time it
executes (Temporal Correlation)

1-bit Saturating Counter

T Predict Predict NT
T NT
T
34
Exploiting Spatial
Correlation
Yeh and Patt, 1992
if (x[i] < 7) then
y += 1;
if (x[i] < 5) then
c -= 4;

If first condition false, second condition also false

Branch History Register, BHR, records the direction

of the last N branches executed by the processor
(Shift Register)
Branch
Outcome BHR
(T/NT)
35
Pattern History Table
(PHT)
PHT

Branch
Outcome BHR Indexes
(T/NT) PHT

FSM
Output
Logic

Prediction (T/NT) 23
Computer Architecture
Slide Deck 9: Advanced Caches

3
7
Categorizing Misses: The Three
C’s

• Compulsory – first-reference to a block, occur even

with infinite cache
• Capacity – cache is too small to hold all data needed by
program, occur even under perfect replacement policy
(loop over 5 cache lines)
• Conflict – misses that occur because of collisions due
to less than full associativity (loop over 3 cache lines) 5
Reduce Hit Time: Small & Simple Caches

Plot from Hennessy and Patterson Ed. 4

6
Image Copyright © 2007-2012 Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Block Size

• Less tag overhead • Can waste bandwidth if data is

• Exploit fast burst transfers not used
from DRAM • Fewer blocks -> more conflicts
• Exploit fast burst transfers
over wide on-chip busses
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Res40
erved.
Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb:

Direct-mapped cache of size N has about the same miss rate
as a two-way set- associative cache of size N/2
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Res42
erved.
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache

Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of
instructions

43
Presence of L2 influences L1 design
• Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time and
reduced L1 miss penalty
– Reduces average access energy
• Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)

44
Victim Cache
• Small Fully Associative cache for recently evicted lines
– Usually small (4-16 blocks)
• Reduced conflict misses
– More associativity for small number of lines
• Can be checked in parallel or series with main cache
• On Miss in L1, Hit in VC: VC->L1, L1->VC
• On Miss in L1, Miss in VC: L1->VC, VC->? (Can always be clean)

CPU Unified
L1 Data L2 Cache
Cache
RF
Evicted Data from L1
Victim ?
Hit Data (miss in L1) Cache (FA,
small) 29
Prefetching
• Speculate on future instruction and data accesses
and fetch them into cache(s)
– Instruction accesses easier to predict than data
accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes

• What types of misses does prefetching

affect?
46
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data

Prefetched data

47
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and
the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream
buffer block into cache and prefetch next block (i+2)
Prefetched
Req
Stream instruction block
block
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
48
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b

• One Block Lookahead (OBL) scheme

– Initiate prefetch for block b + 1 when block b is accessed
– Why is this different from doubling block size?
– Can extend to N-block lookahead

• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch
b+3N etc.

Example: IBM Power 5 [2003] supports eight independent streams of strided

prefetch per processor, prefetching 12 lines ahead of current access

49
Banked Caches
• Partition Address Space into multiple banks
– Use portions of address (low or high order
interleaved)
Benefits:
•Higher throughput
Challenges:
• Bank Conflicts
• Extra Wiring
• Uneven utilization
Address 0 Data 0
Bank 0

Address 1 Bank 1 Data 1

50
Compiler
Optimizations
• Restructuring code affects the data block access
sequence
– Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
• Prevent data from entering the cache
– Useful for variables that will only be accessed once before
being
replaced
– Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hints or page table bits)
• Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality
– Replace into dead cache locations
51
Computer Architecture ELE
475 / COS 475
Slide Deck 10: Address Translation and
Protection

5
2
Dynamic Address
Translation
Location-independent programs
Programming and storage management ease
 need for a base register prog1
Protection

Physical Memory
Independent programs should not affect
each other inadvertently
 need for a bound register
Multiprogramming drives requirement for
resident supervisor to manage context prog2
switches between multiple programs

5
3
Simple Base and Bound
Translation Segment Length
Bound Bounds
Register  Violation?

Physical Memory
Physical current
Effective segment
Load X Address
Address
+
Base
Register
Base Physical Address
Program

Address
Space

Base and bounds registers are visible/accessible only

when processor is running in the supervisor mode
5
4
Separate Areas for Program and Data
Data Bound Bounds
Register
 Violation?
data
Effective Address Logical segment
Load X Register Address

Main Memory
Data Base Physical
Register + Address

Progra Program Bound Bounds

m Register  Violation?
Address Logical program
Program Counter segment
Space Address
Program Base Physical
Register +
Address
What is an advantage of this separation?
(Scheme used on all Cray vector supercomputers prior to X1, 2002)
5
5
Memory Fragmentation
Users 4 & 5 Users 2 & 5
free
OS arrive OS leave OS
Space Space Space
user 1 16K user 1 16K 16K
user 1
user 2 24K user 2 24K 24K
user 4 16K
24K user 4 16K
8K 8K
user 3 32K user 3 32K user 3 32K
24K user 5 24K 24K

As users come and go, the storage is “fragmented”.

Therefore, at some stage programs have to be moved
around to compact the storage.
56
Paged Memory Systems
• Processor-generated address can be interpreted as a pair
<page number, offset>:
page number offset
• A page table contains the physical address of the base of each
page:
1
0 0 0
1 1 Physical
2 2 Memory
3 3 3
Address Space Page Table
of User-1 2
of User-1

Page tables make it possible to store the

pages of a program non-contiguously.

57
Private Address Space per User
OS
User 1 VA1 pages
Page Table

User 2 VA1

Physical Memory
Page Table

User 3 VA1

Page Table free

• Each user has a page table

• Page table contains an entry for each user page

58
Where Should Page Tables
Reside?
• Space required by the page tables (PT) is
proportional to the address space, number of
users, (inverse to) size of each page, ...
– Space requirement is large
– Too expensive to keep in registers
• Idea: Keep PTs in the main memory
– needs one reference to retrieve the page base address
and another to access the data word
• doubles the number of memory references!
– Storage space to store PT grows with size of memory

59
This slide is not in the Final Exam

CMP3010L05-Hazard Continue ILP
No ratings yet
CMP3010L05-Hazard Continue ILP
54 pages
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
No ratings yet
Data-Oriented Design and C++ - Mike Acton - CppCon 2014
201 pages
Instruction Scheduling
No ratings yet
Instruction Scheduling
17 pages
02b ILP Superscalar VLIW
No ratings yet
02b ILP Superscalar VLIW
20 pages
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
No ratings yet
Onur Ddca 2025 Lecture14 Out of Order Execution Afterlecture
114 pages
CompArch 17e ILP-1
No ratings yet
CompArch 17e ILP-1
15 pages
Onur Ddca 2025 Lecture15a Dataflow Superscalar Beforelecture
No ratings yet
Onur Ddca 2025 Lecture15a Dataflow Superscalar Beforelecture
50 pages
4-Advanced Pipelining - 241114 - 060906
No ratings yet
4-Advanced Pipelining - 241114 - 060906
80 pages
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture13 Pipelining Afterlecture
138 pages
CPU With Systems Bus
33% (3)
CPU With Systems Bus
35 pages
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
No ratings yet
Onur 447 Spring15 Lecture11 Precise Exceptions Afterlecture
49 pages
Arm Notes
No ratings yet
Arm Notes
22 pages
Arch5 Precise Exceptions Afterlecture
No ratings yet
Arch5 Precise Exceptions Afterlecture
72 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Fusion Ug
No ratings yet
Fusion Ug
235 pages
Lecture-14-03 02 2025
No ratings yet
Lecture-14-03 02 2025
53 pages
PA2 - Lehra Do! Prefetchers
No ratings yet
PA2 - Lehra Do! Prefetchers
6 pages
Lec9 Multiple Issue Processors
No ratings yet
Lec9 Multiple Issue Processors
33 pages
Unit II
No ratings yet
Unit II
84 pages
03a ILP Superscalar VLIW
No ratings yet
03a ILP Superscalar VLIW
21 pages
Exercises Manual
No ratings yet
Exercises Manual
167 pages
UNIT-2 (Memory Hierarchy Design)
No ratings yet
UNIT-2 (Memory Hierarchy Design)
98 pages
Lec18-Static BRANCH PREDICTION VLIW
No ratings yet
Lec18-Static BRANCH PREDICTION VLIW
40 pages
08 Speculation
No ratings yet
08 Speculation
21 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
Arch4 Pipelined Processor Design Afterlecture
No ratings yet
Arch4 Pipelined Processor Design Afterlecture
130 pages
Register File Prefetching
No ratings yet
Register File Prefetching
14 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Stm32g4 Memory Flash Flash
No ratings yet
Stm32g4 Memory Flash Flash
36 pages
5 Advanced-1
No ratings yet
5 Advanced-1
60 pages
ch4 3
No ratings yet
ch4 3
61 pages
CPU Structure and Function
100% (1)
CPU Structure and Function
30 pages
COA MemorySystem
No ratings yet
COA MemorySystem
76 pages
RN ACA-5 Unit-II
No ratings yet
RN ACA-5 Unit-II
42 pages
Unit 5 - Pipeling and Multipoessors
No ratings yet
Unit 5 - Pipeling and Multipoessors
74 pages
Superscalar
No ratings yet
Superscalar
38 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Dealing With Exceptions
No ratings yet
Dealing With Exceptions
35 pages
Current Log
No ratings yet
Current Log
58 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Parallel Processing
No ratings yet
Parallel Processing
32 pages
Cache Optimizations: Computer Architecture Prof. Muhamed Mudawar
No ratings yet
Cache Optimizations: Computer Architecture Prof. Muhamed Mudawar
38 pages
Stanford Advanced Caches
No ratings yet
Stanford Advanced Caches
46 pages
Advanced Cache Memory Optimizations: Computer Architecture
No ratings yet
Advanced Cache Memory Optimizations: Computer Architecture
44 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
Coa Unit 4
No ratings yet
Coa Unit 4
90 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Advanced Computer Architectures: Exception Handling
No ratings yet
Advanced Computer Architectures: Exception Handling
17 pages
CAunitiii
No ratings yet
CAunitiii
36 pages
CS6461 Computer Architecture Lecture 8
No ratings yet
CS6461 Computer Architecture Lecture 8
61 pages
Exploiting Instruction-Level Parallelism With Software Approaches
No ratings yet
Exploiting Instruction-Level Parallelism With Software Approaches
108 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
Lec02 Superscalar SW VLIW 22 23
No ratings yet
Lec02 Superscalar SW VLIW 22 23
34 pages
Arm 7 Architecture
No ratings yet
Arm 7 Architecture
22 pages
Cache Performance Average Memory Access Time
No ratings yet
Cache Performance Average Memory Access Time
23 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
No ratings yet
Onur Digitaldesign - Comparch 2021 Lecture14 Pipelined Processor Design Afterlecture
97 pages
L05 PipeliningII
No ratings yet
L05 PipeliningII
36 pages
Interrupt and Precise Exception: Computer System Architecture
No ratings yet
Interrupt and Precise Exception: Computer System Architecture
21 pages
BCS402 IA2 (Version A) Scheme 24-25
No ratings yet
BCS402 IA2 (Version A) Scheme 24-25
10 pages
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
No ratings yet
Lecture 13: Trace Scheduling, Conditional Execution, Speculation, Limits of ILP
21 pages
Lecture 7 - PIPELINING
No ratings yet
Lecture 7 - PIPELINING
16 pages
Hw2 Solution
No ratings yet
Hw2 Solution
15 pages
A NOR Emulation Strategy Over NAND Flash Memory PDF
No ratings yet
A NOR Emulation Strategy Over NAND Flash Memory PDF
8 pages
3 Pipeline
No ratings yet
3 Pipeline
38 pages
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
No ratings yet
Lecture 9: Case Study - MIPS R4000 and Introduction To Advanced Pipelining
23 pages
Performance Tuning Guide Ucs m6 Servers
No ratings yet
Performance Tuning Guide Ucs m6 Servers
25 pages
MoE-Infinity - Offloading-Efficient MoE Model Serving
No ratings yet
MoE-Infinity - Offloading-Efficient MoE Model Serving
14 pages
Average Access Time (AAT)
No ratings yet
Average Access Time (AAT)
6 pages
M116C 1 M116C 1 Lec10-Pipeline-II
No ratings yet
M116C 1 M116C 1 Lec10-Pipeline-II
18 pages
STM32L4 Memory Flash
No ratings yet
STM32L4 Memory Flash
35 pages
03ILP Speculation and Advanced Topics
No ratings yet
03ILP Speculation and Advanced Topics
48 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Pipeline Hazards: Structural Hazards: Resource Conflict
No ratings yet
Pipeline Hazards: Structural Hazards: Resource Conflict
49 pages
Comp+Arch+Wk+6 Interrupts+++FP+Pipeline Aut09
No ratings yet
Comp+Arch+Wk+6 Interrupts+++FP+Pipeline Aut09
31 pages
Coa Poster Content
No ratings yet
Coa Poster Content
2 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
36 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Cs2354 Advanced Computer Architecture 2 Marks
No ratings yet
Cs2354 Advanced Computer Architecture 2 Marks
10 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Bare Metal C: Embedded Programming for the Real World
From Everand
Bare Metal C: Embedded Programming for the Real World
Stephen Oualline
No ratings yet
C Programming for Arduino
From Everand
C Programming for Arduino
Julien Bayle
4/5 (13)
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet

Computer Architecture Revision For Final Exam

Uploaded by

Computer Architecture Revision For Final Exam

Uploaded by

Revision Slides for Computer Architecture

• Asynchronous: an external event

• An I/O device requests attention by asserting

• When the processor decides to process the

instructions up to Ii-1 (a precise interrupt)

– It saves the PC of instruction I in a special register (EPC)

• In general, the instruction cannot be completed and

• In the case of a system call trap, the instruction is

PC address Illegal Data address

• How to handle multiple simultaneous exceptions in

 Frontend = Instruction Fetch and Decode stages

• Register Renaming: Change naming of registers

• Can still execute loads and stores speculatively,

• Need a structure to handle memory ordering…

• Execute load before store address known

• Need to hold all completed but uncommitted load/store

• If subsequently find r4==r2, squash load and all following

• Guess that r4 != r2 and execute load before

Two Integer Units,

Loop Iteration time

Pentium 3: 10 cycle branch penalty

Two types of Prediction

Know target address for branches, J, JAL

• Change the ISA semantics so that the instruction

1-bit Saturating Counter

If first condition false, second condition also false

Branch History Register, BHR, records the direction

• Compulsory – first-reference to a block, occur even

Plot from Hennessy and Patterson Ed. 4

• Less tag overhead • Can waste bandwidth if data is

Empirical Rule of Thumb:

Empirical Rule of Thumb:

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache

• What types of misses does prefetching

• One Block Lookahead (OBL) scheme

Example: IBM Power 5 [2003] supports eight independent streams of strided

Address 1 Bank 1 Data 1

Base and bounds registers are visible/accessible only

Progra Program Bound Bounds

As users come and go, the storage is “fragmented”.

Page tables make it possible to store the

Page Table free

• Each user has a page table

You might also like