Computer Architecture Revision For Final Exam

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

Revision Slides for Computer Architecture

- Maaruf Ali

1
Computer Architecture Slide Deck 5: Superscalar 2 and Exceptions

Causes of Exceptions
Interrupt: an event that requests the attention of the processor

• Asynchronous: an external event


– input/output device service request
– timer expiration
– power disruptions, hardware failure
• Synchronous: an internal exception (a.k.a.
exceptions/trap)
– undefined opcode, privileged instruction
– arithmetic overflow, FPU exception
– misaligned memory access
– virtual memory exceptions: page faults,
TLB misses, protection violations
– software exceptions: system calls, e.g., 2
Asynchronous Interrupts:
invoking the interrupt handler

• An I/O device requests attention by asserting


one of the prioritized interrupt request lines

• When the processor decides to process the


interrupt
– It stops the current program at instruction I , completing all the
i

instructions up to Ii-1 (a precise interrupt)

– It saves the PC of instruction I in a special register (EPC)


i
– It disables interrupts and transfers control to a designated interrupt
handler running in the kernel mode 3
Interrupt Handler
• Saves EPC before re-enabling interrupts to allow nested
interrupts 
– need an instruction to move EPC into GPRs
– need a way to mask further interrupts at least until EPC can be saved
• Needs to read a status register that indicates the cause
of the interrupt
• Uses a special indirect jump instruction RFE (return-
from-exception) to resume user code, this:
– enables interrupts
– restores the processor to the user mode
– restores hardware status and control state

4
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction

• In general, the instruction cannot be completed and


needs to be restarted after the exception has been
handled
– requires undoing the effect of one or more partially executed instructions

• In the case of a system call trap, the instruction is


considered to have been completed
– syscall is a special jump instruction involving a change to privileged kernel
mode
– Handler resumes at instruction after system call

5
Exception Handling 5-Stage
Pipeline

Inst. Data
PC D Decode E + M W
Mem Mem

PC address Illegal Data address


Overflow
Exception Opcode Exceptions

Asynchronous Interrupts

• How to handle multiple simultaneous exceptions in


different pipeline stages?
• How and where to handle external asynchronous
interrupts?
6
Exception Handling 5-Stage
Pipeline
Commit
Point

Inst. Data
PC D Decode E + M W
Mem Mem
Illegal Overflow Data address
PC address
Opcode Exceptions
Exception

Cause
Exc Exc Exc
D E M

PC PC PC

EP
Select

C
Handler Kill F Kill D Kill E Asynchronous Kill
D E M Interrupts Writeback
PC Stage Stage Stage

7
Exception Pipeline
Diagram
time
t0 t1 t2 t3 t4 t5 t6 t7
(I1) 096: ADD IF1 ID1 . . . . EX1 MA1 nop overflow!
(I2) 100: XOR IF2 ID2 EX2 nop nop
(I3) 104: SUB IF3 ID3 nop nop nop
(I4) 108: ADD IF4 nop nop nop nop
(I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5

time
t0 t1 t2 t3 t4 t5 t6 t7
....
IF I1 I2 I3 I4 nop I5
Resource
I5 ID I1 I2 I3 nop nop I5
Usage
EX I1 I2 nop nop nop I5
MA I1 nop nop nop nop I5
WB
8
Out-Of-Order (OOO)
Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer

 Frontend = Instruction Fetch and Decode stages


 IO = In Order
 OOO = Out Of Order
 Issue = all the operands are ready for execution
 Writeback = Write to Memory/Register
 Commit = instruction cannot be rollbacked but must be completed
 IO2I = In Order; Out Of Order; Out of Order; In Order
 Scoreboard - this a structure where we keep information about what instruction is ready execute.
 Reorder Buffer - typically, when we execute instructions out of order, this is place where we can go and
actually re-order them, to commit them in order. And we resolved all the dependencies so we don't actually
commit out of order.
9
OOO Motivating Code
Sequence
0 MUL R1, R2, R3 0 1
1 ADDIU
2 MUL R5, R1, R4
R11,R10,1 2 4
3 MUL R7, R5, R6
5 6
3
4 ADDIU R12,R11,1
5 ADDIU R13,R12,1
6 ADDIU R14,R12,2
 Two independent sequences of instructions enable
flexibility in terms of how instructions are scheduled in
total order
 We can schedule statically in software or dynamically
in hardware
 Instruction level parallelism (multiple instructions can
be executed in the same cycle when there is no10 data
Computer Architecture
ELE 475 / COS 475
Slide Deck 6:
Superscalar 3

1
1
Register
Renaming
• Adding more “Names” (registers/memory)
removes dependence, but architecture
namespace is limited.
– Registers: Larger namespace requires more bits in
instruction encoding. 32 registers = 5 bits,
128 registers = 7 bits.

• Register Renaming: Change naming of registers


in hardware to eliminate WAW and WAR hazards

12
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave IQ for execution
until all previous loads and stores have
completed execution

• Can still execute loads and stores speculatively,


and out-of-order with respect to other (non-
memory) instructions

• Need a structure to handle memory ordering…

13
Address Speculation
st R1, 0(R2)
ld R3,
0(R4)
• Guess that r4 != r2

• Execute load before store address known

• Need to hold all completed but uncommitted load/store


addresses in program order

• If subsequently find r4==r2, squash load and all following


instructions
=> Large penalty for inaccurate address speculation 37
Memory Dependence Prediction
(Alpha 21264)
st r1, (r2)
ld r3, (r4)

• Guess that r4 != r2 and execute load before


store
• If later find r4==r2, squash load and all
following instructions, but mark load
instruction as store-wait
• Subsequent executions of the same
load
instruction will wait for all previous stores to
complete
• Periodically clear store-wait bits
15
Computer Architecture
ELE 475 / COS 475
Slide Deck 7: VLIW
David Wentzlaff
Department of Electrical Engineering
Princeton
University
1
6
VLIW: Very Long Instruction
Word
Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2

Two Integer Units,


Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency Two Floating-Point Units,
Four Cycle Latency
• Multiple operations packed into one instruction
• Each operation slot is for a fixed function
• Constant operation latencies are specified
• Architecture requires guarantee of:
– Parallelism within an instruction => no cross-operation RAW check
– No data use before data ready => no data interlocks
17
Software Pipelining vs. Loop
Unrolling
Loop Unrolled Wind-down overhead
performance

Startup overhead

Loop Iteration time

Software Pipelined
performance

Loop time
Iteration
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
18
Trace Scheduling
[ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace

19
Problems with “Classic” VLIW
• Object-code compatibility
– have to recompile all code for every machine, even for two machines in same
generation
• Object code size
– instruction padding wastes instruction memory/cache
– loop unrolling/software pipelining replicates code
• Scheduling variable latency memory operations
– caches and/or memory bank conflicts impose statically unpredictable
variability
• Knowing branch probabilities
– Profiling requires an significant extra step in build process
• Scheduling for statically unpredictable branches
– optimal schedule varies with branch path
• Precise Interrupts can be challenging
– Does fault in one portion of bundle fault whole bundle?
– EQ Model has problem with single step, etc.
20
Code Motion
Before Code Motion After Code Motion
MUL R1, R2, R3 LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R5, R1, R4 MUL R1, R2, R3
MUL R7, R5, ADDIU R12,R11,1
R6
SW R7, 0(R16) MUL R5, R1, R4
ADDIU R12,R11,1 ADD R13,R12,R14
LW R14, 0(R9) MUL R7, R5, R6
ADD R13,R12,R14 ADD R14,R12,R13
ADD R14,R12,R13 SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target
Scheduling and Bundling
Before Bundling After Bundling
LW R14, 0(R9) {LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R1, R2, R3 MUL R1, R2, R3}
ADDIU R12,R11,1 {ADDIU R12,R11,1
MUL R5, R1, R4 MUL R5, R1, R4}
ADD R13,R12,R14 {ADD R13,R12,R14
MUL R7, R5, R6 MUL R7, R5, R6}
ADD R14,R12,R13 {ADD R14,R12,R13
SW R7, 0(R16) SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target}
Prologue
• In computer architecture, the
terms “prologue” and “epilogue” refer to specific sections
of code within a function. Prologue:
1.The function prologue is a set of instructions that
appear at the beginning of a function.
2.Its purpose is to prepare the stack and registers for use
within the function.
3.Key actions performed by the prologue include:
1.Saving any registers that the function might use
(registers that are required by the platform’s standard
to be preserved across function calls).
2.Setting up the stack to allocate space for local
variables.
3.Establishing a base pointer (or frame pointer) to track
the top of the stack.
4.The prologue ensures that the function has a clean slate to
work with.
Epilogue
 The function epilogue appears at the end of a
function.
 Its purpose is to restore the stack and registers to the
state they were in before the function was called.
 Key actions performed by the epilogue include:
 Dropping the stack pointer back to the current base
pointer, freeing the room reserved for local variables.
 Popping the base pointer off the stack, restoring it to
its value before the prologue.
 Returning control to the calling function by popping
the previous frame’s program counter off the stack
and jumping to it.
 Essentially, the epilogue cleans up after the function
execution.
The Need for Prolog and Epilog
 These prologue and epilogue sections
are essential for managing the
function’s context and ensuring proper
execution within the broader program.
 They are conventions used by assembly
language programmers and compilers
of higher-level languages
.
Register Rotation
 In computer architecture, register rotation refers
to a technique where the bits within a register
are circularly shifted around the two ends
without any loss of data or contents.
 In the context of shift registers, register
rotation involves circularly shifting the bits within
a register.
 The serial output of the shift register connects to
its serial input.
 Notably, CIL (Circular Shift Left) and CIR
(Circular Shift Right) instructions are used for circul
ar shifts left and right, respectively
.
Computer Architecture
Slide Deck 8: Branch
Prediction

2
7
Longer Pipeline Frontends
Amplify Branch Cost

Pentium 3: 10 cycle branch penalty


Pentium 4: 20 cycle branch penalty
Image from: The Microarchitecture of the Pentium 4 Processor by Glenn Hinton et al.
5
Appeared in Intel Technology Journal Q1, 2001. Image courtesy of Intel
Branch Prediction
• Essential in modern processors to mitigate
branch delay latencies

Two types of Prediction


1. Predict Branch Outcome
2. Predict Branch/Jump Address

29
Where is the Branch Information Known?

F D I X M
W Know branch outcome
Know target address for JR, JALR

Know target address for branches, J, JAL

30
Branch Delay Slots
(expose control hazard to software)

• Change the ISA semantics so that the instruction


that follows a jump or branch is always
executed
– gives compiler the flexibility to put in a useful instruction where normally
a pipeline bubble would have resulted.
I1 096 ADD
100 BEQZ r1 +200
I2 104 ADD Delay slot instructions executed
108 ADD regardless of branch outcome
I3 304 ADD

I4

I5
31
Static Branch
Prediction
Overall probability a branch is taken is ~60-70% but:

BEZ
backward forward
90% 50%
BEZ

32
Static Hardware Branch Prediction
1. Always Predict Not-Taken
– What we have been assuming
– Simple to implement
– Know fall-through PC in Fetch
– Poor Accuracy, especially on backward branches
2. Always Predict Taken
– Difficult to implement because don’t know target until
Decode
– Poor accuracy on if-then-else
3. Backward Branch Taken, Forward Branch Not Taken
– Better Accuracy
– Difficult to implement because don’t know target until
Decode
33
Dynamic Hardware Branch
Prediction: Exploiting Temporal
Correlation
• Exploit structure in program: The way a
branch resolves may be a good indicator of
the way it will resolve the next time it
executes (Temporal Correlation)

1-bit Saturating Counter


NT

T Predict Predict NT
T NT
T
34
Exploiting Spatial
Correlation
Yeh and Patt, 1992
if (x[i] < 7) then
y += 1;
if (x[i] < 5) then
c -= 4;

If first condition false, second condition also false

Branch History Register, BHR, records the direction


of the last N branches executed by the processor
(Shift Register)
Branch
Outcome BHR
(T/NT)
35
Pattern History Table
(PHT)
PHT

Branch
Outcome BHR Indexes
(T/NT) PHT

FSM
Output
Logic

Prediction (T/NT) 23
Computer Architecture
Slide Deck 9: Advanced Caches

3
7
Categorizing Misses: The Three
C’s

• Compulsory – first-reference to a block, occur even


with infinite cache
• Capacity – cache is too small to hold all data needed by
program, occur even under perfect replacement policy
(loop over 5 cache lines)
• Conflict – misses that occur because of collisions due
to less than full associativity (loop over 3 cache lines) 5
Reduce Hit Time: Small & Simple Caches

Plot from Hennessy and Patterson Ed. 4


6
Image Copyright © 2007-2012 Elsevier Inc. All rights Reserved.
Reduce Miss Rate: Large Block Size

• Less tag overhead • Can waste bandwidth if data is


• Exploit fast burst transfers not used
from DRAM • Fewer blocks -> more conflicts
• Exploit fast burst transfers
over wide on-chip busses
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Res40
erved.
Reduce Miss Rate: Large Cache Size

Empirical Rule of Thumb:


If cache size is doubled, miss rate usually drops by about √2
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Res41
erved.
Reduce Miss Rate: High Associativity

Empirical Rule of Thumb:


Direct-mapped cache of size N has about the same miss rate
as a two-way set- associative cache of size N/2
Plot from Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Res42
erved.
Multilevel Caches
Problem: A memory cannot be large and fast
Solution: Increasing sizes of cache at each level

CPU L1$ L2$ DRAM

Local miss rate = misses in cache / accesses to cache


Global miss rate = misses in cache / CPU memory accesses
Misses per instruction = misses in cache / number of
instructions

43
Presence of L2 influences L1 design
• Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time and
reduced L1 miss penalty
– Reduces average access energy
• Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)

44
Victim Cache
• Small Fully Associative cache for recently evicted lines
– Usually small (4-16 blocks)
• Reduced conflict misses
– More associativity for small number of lines
• Can be checked in parallel or series with main cache
• On Miss in L1, Hit in VC: VC->L1, L1->VC
• On Miss in L1, Miss in VC: L1->VC, VC->? (Can always be clean)

CPU Unified
L1 Data L2 Cache
Cache
RF
Evicted Data from L1
Victim ?
Hit Data (miss in L1) Cache (FA,
small) 29
Prefetching
• Speculate on future instruction and data accesses
and fetch them into cache(s)
– Instruction accesses easier to predict than data
accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes

• What types of misses does prefetching


affect?
46
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data

Prefetched data

47
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and
the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream
buffer block into cache and prefetch next block (i+2)
Prefetched
Req
Stream instruction block
block
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
48
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b

• One Block Lookahead (OBL) scheme


– Initiate prefetch for block b + 1 when block b is accessed
– Why is this different from doubling block size?
– Can extend to N-block lookahead

• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch
b+3N etc.

Example: IBM Power 5 [2003] supports eight independent streams of strided


prefetch per processor, prefetching 12 lines ahead of current access

49
Banked Caches
• Partition Address Space into multiple banks
– Use portions of address (low or high order
interleaved)
Benefits:
•Higher throughput
Challenges:
• Bank Conflicts
• Extra Wiring
• Uneven utilization
Address 0 Data 0
Bank 0

Address 1 Bank 1 Data 1

50
Compiler
Optimizations
• Restructuring code affects the data block access
sequence
– Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
• Prevent data from entering the cache
– Useful for variables that will only be accessed once before
being
replaced
– Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hints or page table bits)
• Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality
– Replace into dead cache locations
51
Computer Architecture ELE
475 / COS 475
Slide Deck 10: Address Translation and
Protection

5
2
Dynamic Address
Translation
Location-independent programs
Programming and storage management ease
 need for a base register prog1
Protection

Physical Memory
Independent programs should not affect
each other inadvertently
 need for a bound register
Multiprogramming drives requirement for
resident supervisor to manage context prog2
switches between multiple programs

OS

5
3
Simple Base and Bound
Translation Segment Length
Bound Bounds
Register  Violation?

Physical Memory
Physical current
Effective segment
Load X Address
Address
+
Base
Register
Base Physical Address
Program

Address
Space

Base and bounds registers are visible/accessible only


when processor is running in the supervisor mode
5
4
Separate Areas for Program and Data
Data Bound Bounds
Register
 Violation?
data
Effective Address Logical segment
Load X Register Address

Main Memory
Data Base Physical
Register + Address

Progra Program Bound Bounds


m Register  Violation?
Address Logical program
Program Counter segment
Space Address
Program Base Physical
Register +
Address
What is an advantage of this separation?
(Scheme used on all Cray vector supercomputers prior to X1, 2002)
5
5
Memory Fragmentation
Users 4 & 5 Users 2 & 5
free
OS arrive OS leave OS
Space Space Space
user 1 16K user 1 16K 16K
user 1
user 2 24K user 2 24K 24K
user 4 16K
24K user 4 16K
8K 8K
user 3 32K user 3 32K user 3 32K
24K user 5 24K 24K

As users come and go, the storage is “fragmented”.


Therefore, at some stage programs have to be moved
around to compact the storage.
56
Paged Memory Systems
• Processor-generated address can be interpreted as a pair
<page number, offset>:
page number offset
• A page table contains the physical address of the base of each
page:
1
0 0 0
1 1 Physical
2 2 Memory
3 3 3
Address Space Page Table
of User-1 2
of User-1

Page tables make it possible to store the


pages of a program non-contiguously.

57
Private Address Space per User
OS
User 1 VA1 pages
Page Table

User 2 VA1

Physical Memory
Page Table

User 3 VA1

Page Table free

• Each user has a page table


• Page table contains an entry for each user page

58
Where Should Page Tables
Reside?
• Space required by the page tables (PT) is
proportional to the address space, number of
users, (inverse to) size of each page, ...
– Space requirement is large
– Too expensive to keep in registers
• Idea: Keep PTs in the main memory
– needs one reference to retrieve the page base address
and another to access the data word
• doubles the number of memory references!
– Storage space to store PT grows with size of memory

59
This slide is not in the Final Exam

You might also like