Computer Architecture Revision For Final Exam
Computer Architecture Revision For Final Exam
Computer Architecture Revision For Final Exam
- Maaruf Ali
1
Computer Architecture Slide Deck 5: Superscalar 2 and Exceptions
Causes of Exceptions
Interrupt: an event that requests the attention of the processor
4
Synchronous Interrupts
• A synchronous interrupt (exception) is caused by a
particular instruction
5
Exception Handling 5-Stage
Pipeline
Inst. Data
PC D Decode E + M W
Mem Mem
Asynchronous Interrupts
Inst. Data
PC D Decode E + M W
Mem Mem
Illegal Overflow Data address
PC address
Opcode Exceptions
Exception
Cause
Exc Exc Exc
D E M
PC PC PC
EP
Select
C
Handler Kill F Kill D Kill E Asynchronous Kill
D E M Interrupts Writeback
PC Stage Stage Stage
7
Exception Pipeline
Diagram
time
t0 t1 t2 t3 t4 t5 t6 t7
(I1) 096: ADD IF1 ID1 . . . . EX1 MA1 nop overflow!
(I2) 100: XOR IF2 ID2 EX2 nop nop
(I3) 104: SUB IF3 ID3 nop nop nop
(I4) 108: ADD IF4 nop nop nop nop
(I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5
time
t0 t1 t2 t3 t4 t5 t6 t7
....
IF I1 I2 I3 I4 nop I5
Resource
I5 ID I1 I2 I3 nop nop I5
Usage
EX I1 I2 nop nop nop I5
MA I1 nop nop nop nop I5
WB
8
Out-Of-Order (OOO)
Introduction
Name Frontend Issue Writeback Commit
I4 IO IO IO IO Fixed Length Pipelines
Scoreboard
I2O2 IO IO OOO OOO Scoreboard
I2OI IO IO OOO IO Scoreboard,
Reorder Buffer, and Store Buffer
I03 IO OOO OOO OOO Scoreboard and Issue Queue
IO2I IO OOO OOO IO Scoreboard, Issue Queue,
Reorder Buffer, and Store Buffer
1
1
Register
Renaming
• Adding more “Names” (registers/memory)
removes dependence, but architecture
namespace is limited.
– Registers: Larger namespace requires more bits in
instruction encoding. 32 registers = 5 bits,
128 registers = 7 bits.
12
In-Order Memory Queue
• Execute all loads and stores in program order
=> Load and store cannot leave IQ for execution
until all previous loads and stores have
completed execution
13
Address Speculation
st R1, 0(R2)
ld R3,
0(R4)
• Guess that r4 != r2
Startup overhead
Software Pipelined
performance
Loop time
Iteration
Software pipelining pays startup/wind-down costs
only once per loop, not once per iteration
18
Trace Scheduling
[ Fisher,Ellis]
• Pick string of basic blocks, a trace, that
represents most frequent branch path
• Use profiling feedback or compiler
heuristics to find common branch paths
• Schedule whole “trace” at once
• Add fixup code to cope with branches
jumping out of trace
19
Problems with “Classic” VLIW
• Object-code compatibility
– have to recompile all code for every machine, even for two machines in same
generation
• Object code size
– instruction padding wastes instruction memory/cache
– loop unrolling/software pipelining replicates code
• Scheduling variable latency memory operations
– caches and/or memory bank conflicts impose statically unpredictable
variability
• Knowing branch probabilities
– Profiling requires an significant extra step in build process
• Scheduling for statically unpredictable branches
– optimal schedule varies with branch path
• Precise Interrupts can be challenging
– Does fault in one portion of bundle fault whole bundle?
– EQ Model has problem with single step, etc.
20
Code Motion
Before Code Motion After Code Motion
MUL R1, R2, R3 LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R5, R1, R4 MUL R1, R2, R3
MUL R7, R5, ADDIU R12,R11,1
R6
SW R7, 0(R16) MUL R5, R1, R4
ADDIU R12,R11,1 ADD R13,R12,R14
LW R14, 0(R9) MUL R7, R5, R6
ADD R13,R12,R14 ADD R14,R12,R13
ADD R14,R12,R13 SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target
Scheduling and Bundling
Before Bundling After Bundling
LW R14, 0(R9) {LW R14, 0(R9)
ADDIU R11,R10,1 ADDIU R11,R10,1
MUL R1, R2, R3 MUL R1, R2, R3}
ADDIU R12,R11,1 {ADDIU R12,R11,1
MUL R5, R1, R4 MUL R5, R1, R4}
ADD R13,R12,R14 {ADD R13,R12,R14
MUL R7, R5, R6 MUL R7, R5, R6}
ADD R14,R12,R13 {ADD R14,R12,R13
SW R7, 0(R16) SW R7, 0(R16)
BNEQ R16, target BNEQ R16, target}
Prologue
• In computer architecture, the
terms “prologue” and “epilogue” refer to specific sections
of code within a function. Prologue:
1.The function prologue is a set of instructions that
appear at the beginning of a function.
2.Its purpose is to prepare the stack and registers for use
within the function.
3.Key actions performed by the prologue include:
1.Saving any registers that the function might use
(registers that are required by the platform’s standard
to be preserved across function calls).
2.Setting up the stack to allocate space for local
variables.
3.Establishing a base pointer (or frame pointer) to track
the top of the stack.
4.The prologue ensures that the function has a clean slate to
work with.
Epilogue
The function epilogue appears at the end of a
function.
Its purpose is to restore the stack and registers to the
state they were in before the function was called.
Key actions performed by the epilogue include:
Dropping the stack pointer back to the current base
pointer, freeing the room reserved for local variables.
Popping the base pointer off the stack, restoring it to
its value before the prologue.
Returning control to the calling function by popping
the previous frame’s program counter off the stack
and jumping to it.
Essentially, the epilogue cleans up after the function
execution.
The Need for Prolog and Epilog
These prologue and epilogue sections
are essential for managing the
function’s context and ensuring proper
execution within the broader program.
They are conventions used by assembly
language programmers and compilers
of higher-level languages
.
Register Rotation
In computer architecture, register rotation refers
to a technique where the bits within a register
are circularly shifted around the two ends
without any loss of data or contents.
In the context of shift registers, register
rotation involves circularly shifting the bits within
a register.
The serial output of the shift register connects to
its serial input.
Notably, CIL (Circular Shift Left) and CIR
(Circular Shift Right) instructions are used for circul
ar shifts left and right, respectively
.
Computer Architecture
Slide Deck 8: Branch
Prediction
2
7
Longer Pipeline Frontends
Amplify Branch Cost
29
Where is the Branch Information Known?
F D I X M
W Know branch outcome
Know target address for JR, JALR
30
Branch Delay Slots
(expose control hazard to software)
I4
I5
31
Static Branch
Prediction
Overall probability a branch is taken is ~60-70% but:
BEZ
backward forward
90% 50%
BEZ
32
Static Hardware Branch Prediction
1. Always Predict Not-Taken
– What we have been assuming
– Simple to implement
– Know fall-through PC in Fetch
– Poor Accuracy, especially on backward branches
2. Always Predict Taken
– Difficult to implement because don’t know target until
Decode
– Poor accuracy on if-then-else
3. Backward Branch Taken, Forward Branch Not Taken
– Better Accuracy
– Difficult to implement because don’t know target until
Decode
33
Dynamic Hardware Branch
Prediction: Exploiting Temporal
Correlation
• Exploit structure in program: The way a
branch resolves may be a good indicator of
the way it will resolve the next time it
executes (Temporal Correlation)
T Predict Predict NT
T NT
T
34
Exploiting Spatial
Correlation
Yeh and Patt, 1992
if (x[i] < 7) then
y += 1;
if (x[i] < 5) then
c -= 4;
Branch
Outcome BHR Indexes
(T/NT) PHT
FSM
Output
Logic
Prediction (T/NT) 23
Computer Architecture
Slide Deck 9: Advanced Caches
3
7
Categorizing Misses: The Three
C’s
43
Presence of L2 influences L1 design
• Use smaller L1 if there is also L2
– Trade increased L1 miss rate for reduced L1 hit time and
reduced L1 miss penalty
– Reduces average access energy
• Use simpler write-through L1 with on-chip L2
– Write-back L2 cache absorbs write traffic, doesn’t go off-chip
– At most one L1 miss request per L1 access (no dirty victim write
back) simplifies pipeline control
– Simplifies coherence issues
– Simplifies error recovery in L1 (can use just parity bits in L1 and
reload from L2 when parity error detected on L1 read)
44
Victim Cache
• Small Fully Associative cache for recently evicted lines
– Usually small (4-16 blocks)
• Reduced conflict misses
– More associativity for small number of lines
• Can be checked in parallel or series with main cache
• On Miss in L1, Hit in VC: VC->L1, L1->VC
• On Miss in L1, Miss in VC: L1->VC, VC->? (Can always be clean)
CPU Unified
L1 Data L2 Cache
Cache
RF
Evicted Data from L1
Victim ?
Hit Data (miss in L1) Cache (FA,
small) 29
Prefetching
• Speculate on future instruction and data accesses
and fetch them into cache(s)
– Instruction accesses easier to predict than data
accesses
• Varieties of prefetching
– Hardware prefetching
– Software prefetching
– Mixed schemes
Prefetched data
47
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and
the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream
buffer block into cache and prefetch next block (i+2)
Prefetched
Req
Stream instruction block
block
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
48
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b
• Strided prefetch
– If observe sequence of accesses to block b, b+N, b+2N, then prefetch
b+3N etc.
49
Banked Caches
• Partition Address Space into multiple banks
– Use portions of address (low or high order
interleaved)
Benefits:
•Higher throughput
Challenges:
• Bank Conflicts
• Extra Wiring
• Uneven utilization
Address 0 Data 0
Bank 0
50
Compiler
Optimizations
• Restructuring code affects the data block access
sequence
– Group data accesses together to improve spatial locality
– Re-order data accesses to improve temporal locality
• Prevent data from entering the cache
– Useful for variables that will only be accessed once before
being
replaced
– Needs mechanism for software to tell hardware not to cache data
(“no-allocate” instruction hints or page table bits)
• Kill data that will never be used again
– Streaming data exploits spatial locality but not temporal locality
– Replace into dead cache locations
51
Computer Architecture ELE
475 / COS 475
Slide Deck 10: Address Translation and
Protection
5
2
Dynamic Address
Translation
Location-independent programs
Programming and storage management ease
need for a base register prog1
Protection
Physical Memory
Independent programs should not affect
each other inadvertently
need for a bound register
Multiprogramming drives requirement for
resident supervisor to manage context prog2
switches between multiple programs
OS
5
3
Simple Base and Bound
Translation Segment Length
Bound Bounds
Register Violation?
Physical Memory
Physical current
Effective segment
Load X Address
Address
+
Base
Register
Base Physical Address
Program
Address
Space
Main Memory
Data Base Physical
Register + Address
57
Private Address Space per User
OS
User 1 VA1 pages
Page Table
User 2 VA1
Physical Memory
Page Table
User 3 VA1
58
Where Should Page Tables
Reside?
• Space required by the page tables (PT) is
proportional to the address space, number of
users, (inverse to) size of each page, ...
– Space requirement is large
– Too expensive to keep in registers
• Idea: Keep PTs in the main memory
– needs one reference to retrieve the page base address
and another to access the data word
• doubles the number of memory references!
– Storage space to store PT grows with size of memory
59
This slide is not in the Final Exam