0% found this document useful (0 votes)
72 views

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

The memory hierarchy is important because the gap between CPU and memory speeds continues to grow significantly each year. Caches attempt to bridge this gap by providing faster but smaller memory closer to the CPU. Caches exploit locality by storing recently or frequently accessed data from slower further memory. Common cache organizations include direct mapped, set associative, and fully associative caches which balance access time, complexity, and flexibility.

Uploaded by

ahmadkhan82
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

The memory hierarchy is important because the gap between CPU and memory speeds continues to grow significantly each year. Caches attempt to bridge this gap by providing faster but smaller memory closer to the CPU. Caches exploit locality by storing recently or frequently accessed data from slower further memory. Common cache organizations include direct mapped, set associative, and fully associative caches which balance access time, complexity, and flexibility.

Uploaded by

ahmadkhan82
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Question: Who Cares About the

Memory Hierarchy?

1000 CPU
µProc
60%/yr.
Caches and Memory Systems I
“Moore’s Law”

Performance
CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1

1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1980
1981
1982
1983
1984
1985

1999
2000
1986
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)

What is a cache? Example: 1 KB Direct Mapped Cache


• Small, fast storage used to improve average access • For a 2 ** N byte cache:
time to slow memory. – The uppermost (32 - N) bits are always the Cache Tag
• Exploits spacial and temporal locality – The lowest M bits are the Byte Select (Block Size = 2 ** M)

• In computer architecture, almost everything is a cache! Block address


31 9 4 0
– Registers a cache on variables
Cache Tag Example: 0x50 Cache Index Byte Select
– First-level cache a cache on second-level cache Ex: 0x01 Ex: 0x00
– Second-level cache a cache on memory Stored as part
– Memory a cache on disk (virtual memory) of the cache “state”

– TLB a cache on page table Valid Bit Cache Tag Cache Data
– Branch-prediction a cache on prediction information?

: :
Byte 31 Byte 1 Byte 0 0
Proc/Regs 0x50 Byte 63 Byte 33 Byte 32 1
2
L1-Cache 3
Bigger L2-Cache Faster
: : :
Memory

:
Byte 1023 Byte 992 31
Disk, Tape, etc.
Set Associative Cache Disadvantage of Set Associative Cache
• N-way set associative: N entries for each Cache
Index • N-way Set Associative Cache versus Direct Mapped
– N direct mapped caches operates in parallel Cache:
– N comparators vs. 1
• Example: Two-way set associative cache – Extra MUX delay for the data
– Cache Index selects a “set” from the cache – Data comes AFTER Hit/Miss decision and set selection
– The two tags in the set are compared to the input in parallel
• In a direct mapped cache, Cache Block is available
– Data is selected based on the tag result
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0 Cache Block 0 Cache Block 0

: : : : : : : : : : : :

Adr Tag Adr Tag


Compare Sel1 1 Mux 0 Sel0 Compare Compare Sel1 1 Mux 0 Sel0 Compare

OR OR
Cache Block Cache Block
Hit Hit

Review: Cache performance Impact on Performance


• Suppose a processor executes at
• Miss-oriented Approach to Memory Access:
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
⎛ MemAccess ⎞ – 50% arith/logic, 30% ld/st, 20% control
CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime
⎝ Execution Inst ⎠ • Suppose that 10% of memory operations get 50 cycle
⎛ MemMisses ⎞ miss penalty
CPUtime = IC × ⎜ CPI + × MissPenalty ⎟ × CycleTime
⎝ Execution Inst ⎠ • Suppose that 1% of instructions get same miss penalty
– CPIExecution includes ALU and Memory instructions • CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
• Separating out Memory component entirely [ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
– AMAT = Average Memory Access Time
[ 1 (InstMop/ins)
– CPIALUOps does not include memory instructions
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
⎛ AluOps
CPUtime = IC × ⎜ × CPI +
MemAccess ⎞
× AMAT ⎟ × CycleTime = (1.1 + 1.5 + .5) cycle/ins = 3.1
⎝ Inst ⎠ • 58% of the time the proc is stalled waiting for memory!
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)
Review: Four Questions for
Memory Hierarchy Designers
Proc
I-Cache-1 Proc D-Cache-1
Unified
Cache-1 Unified
• Q1: Where can a block be placed in the upper level?
Unified
Cache-2 (Block placement)
Cache-2 – Fully Associative, Set Associative, Direct Mapped
• Table on page 384: • Q2: How is a block found if it is in the upper level?
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% (Block identification)
– 32KB unified: Aggregate miss rate=1.99% – Tag/Block
• Which is better (ignore L2 cache)? • Q3: Which block should be replaced on a miss?
– Assume 33% data ops ⇒ 75% accesses from instructions (1.0/1.33) (Block replacement)
– hit time=1, miss time=50 – Random, LRU
– Note that data hit has 1 stall for unified cache (only one port)
• Q4: What happens on a write?
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 (Write strategy)
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 – Write Back or Write Through (with Write Buffer)

Review: Improving Cache Reducing Misses


Performance • Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache,
so the block must be brought into the cache. Also called cold
1. Reduce the miss rate, start misses or first reference misses.
2. Reduce the miss penalty, or (Misses in even an Infinite Cache)

3. Reduce the time to hit in the cache. – Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
2:1 Cache Rule
3Cs Absolute Miss Rate
(SPEC92) miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14 0.14
1-way 1-way
0.12 Conflict 0.12 Conflict
2-way 2-way
0.1 0.1
4-way 4-way
0.08 0.08
8-way 8-way
0.06 0.06
Capacity Capacity
0.04 0.04
0.02 0.02
0 0
1

8
16

32

64

16

32

64
128

128
Compulsory vanishingly Compulsory Compulsory
Cache Size (KB) Cache Size (KB)
small

3Cs Relative Miss Rate How Can Reduce Misses?


100% • 3 Cs: Compulsory, Capacity, Conflict
1-way
• In all cases, assume total cache size not changed:
80% Conflict
2-way • What happens if:
4-way
60% 8-way 1) Change Block Size:
Which of 3Cs is obviously affected?
40%
Capacity
2) Change Associativity:
20% Which of 3Cs is obviously affected?

0% 3) Change Compiler:
Which of 3Cs is obviously affected?
1

16

32

64

128

Flaws: for fixed block size


Good: insight => invention Cache Size (KB) Compulsory
1. Reduce Misses via Larger 2. Reduce Misses via Higher
Block Size Associativity
25%
• 2:1 Cache Rule:
1K – Miss Rate DM cache size N ­ Miss Rate 2-way cache
20%
size N/2

15%
4K
• Beware: Execution time is only final measure!
Miss
16K – Will Clock Cycle time increase?
Rate
10% – Hill [1988] suggested hit time for 2-way vs. 1-way
64K external cache +10%,
internal + 2%
5% 256K

0%
16

32

64

128

Block Size (bytes) 256

3. Reducing Misses via a


Example: Avg. Memory Access “Victim Cache”
Time vs. Miss Rate
• How to combine fast hit time
• Example: assume CCT = 1.10 for 2-way, 1.12 for of direct mapped
4-way, 1.14 for 8-way vs. CCT direct mapped yet still avoid conflict misses?
TAGS DATA
Cache Size Associativity
• Add buffer to place data
(KB) 1-way 2-way 4-way 8-way
discarded from cache
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68 • Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
4 1.72 1.67 1.61 1.53
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
8 1.46 1.48 1.47 1.43
mapped data cache
16 1.29 1.32 1.32 1.32 Tag and Comparator One Cache line of Data
32 1.20 1.24 1.25 1.27
• Used in Alpha, HP machines Tag and Comparator One Cache line of Data
64 1.14 1.20 1.21 1.23 Tag and Comparator One Cache line of Data
128 1.10 1.17 1.18 1.20
To Next Lower Level In
Hierarchy
(Red means A.M.A.T. not improved by more associativity)
4. Reducing Misses via
“Pseudo-Associativity” 5. Reducing Misses by Hardware
Prefetching of Instructions & Datals
• How to combine fast hit time of Direct Mapped and
have the lower conflict misses of 2-way SA cache? • E.g., Instruction Prefetching
• Divide cache: on a miss, check other half of cache to – Alpha 21064 fetches 2 blocks on a miss
see if there, if so have a pseudo-hit (slow hit) – Extra block placed in “stream buffer”
– On miss check stream buffer
Hit Time • Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
Pseudo Hit Time Miss Penalty 4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
Time 2 64KB, 4-way set associative caches
• Drawback: CPU pipeline is hard if hit takes 1 or 2 • Prefetching relies on having extra memory
cycles bandwidth that can be used without penalty
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

6. Reducing Misses by 7. Reducing Misses by


Software Prefetching Data Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
• Data Prefetch on 8KB direct mapped cache, 4 byte blocks in software
– Load data into register (HP PA-RISC loads)
• Instructions
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Reorder procedures in memory so as to reduce conflict misses
– Special prefetching instructions cannot cause faults; a form of
speculative execution – Profiling to look at conflicts(using tools they developed)

• Prefetching comes in two flavors: • Data


– Merging Arrays: improve spatial locality by single array of compound elements
– Binding prefetch: Requests load directly into register. vs. 2 arrays
» Must be correct address and register! – Loop Interchange: change nesting of loops to access data in order stored in
– Non-Binding prefetch: Load into cache. memory
» Can be incorrect. Frees HW/SW to guess! – Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
• Issuing Prefetch Instructions takes time – Blocking: Improve temporal locality by accessing “blocks” of data repeatedly
– Is cost of prefetch issues < savings in reduced misses? vs. going down whole columns or rows
– Higher superscalar reduces difficulty of issue bandwidth
Merging Arrays Example Loop Interchange Example

/* Before: 2 sequential arrays */ /* Before */


int val[SIZE]; for (k = 0; k < 100; k = k+1)
int key[SIZE]; for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
/* After: 1 array of stuctures */ x[i][j] = 2 * x[i][j];
struct merge { /* After */
int val; for (k = 0; k < 100; k = k+1)
int key; for (i = 0; i < 5000; i = i+1)
}; for (j = 0; j < 100; j = j+1)
struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding


Reducing conflicts between val & key; through memory every 100 words; improved
improve spatial locality spatial locality

Blocking Example
Loop Fusion Example /* Before */
/* Before */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j]; {r = 0;
for (i = 0; i < N; i = i+1) for (k = 0; k < N; k = k+1){
for (j = 0; j < N; j = j+1) r = r + y[i][k]*z[k][j];};
d[i][j] = a[i][j] + c[i][j]; x[i][j] = r;
/* After */ };
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) • Two Inner Loops:
{ a[i][j] = 1/b[i][j] * c[i][j]; – Read all NxN elements of z[]
d[i][j] = a[i][j] + c[i][j];} – Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
2 misses per access to a & c vs. one miss per • Capacity Misses a function of N & Cache Size:
access; improve spatial locality – 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
Summary of Compiler Optimizations to
Blocking Example Reduce Cache Misses (by hand)
vpenta (nasa7)
/* After */
for (jj = 0; jj < N; jj = jj+B) gmty (nasa7)
for (kk = 0; kk < N; kk = kk+B) tomcatv
for (i = 0; i < N; i = i+1)
btrix (nasa7)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0; mxm (nasa7)
for (k = kk; k < min(kk+B-1,N); k = k+1) {
spice
r = r + y[i][k]*z[k][j];}; cholesky
x[i][j] = x[i][j] + r; (nasa7)
}; compress

• B called Blocking Factor 1 1.5 2 2.5 3


• Capacity Misses from 2N3 + N2 to N3/B+2N2 Performance Improvement
• Conflict Misses Too?
merged loop loop fusion blocking
arrays interchange

Review: Improving Cache


Summary: Miss Rate Reduction
⎛ Memory accesses ⎞
Performance
CPUtime = IC × CPI + × Miss rate × Miss penalty × Clock cycle time
⎝ Execution
Instruction ⎠
1. Reduce the miss rate,
• 3 Cs: Compulsory, Capacity, Conflict
1. Reduce Misses via Larger Block Size 2. Reduce the miss penalty, or
2. Reduce Misses via Higher Associativity 3. Reduce the time to hit in the cache.
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
Write Policy: 1. Reducing Miss Penalty:
Write-Through vs Write-Back Read Priority over Write on Miss
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory CPU
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache in out
Write Buffer
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple times
» Better tolerance to long-latency memory?
write
buffer

DRAM
(or lower mem)

2. Reduce Miss Penalty:


1. Reducing Miss Penalty: Early Restart and Critical Word
Read Priority over Write on Miss First
• Don’t wait for full block to be loaded before
• Write-through with write buffers offer RAW conflicts restarting CPU
with main memory reads on cache misses – Early restart—As soon as the requested word of the block
ar rives, send it to the CPU and let the CPU continue execution
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% ) – Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
– Check write buffer contents before read; execution while filling the rest of the words in the block. Also
if no conflicts, let the memory access continue called wrapped fetch and requested word first
• Write-back also want buffer to hold misplaced blocks • Generally useful only in large blocks,
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
• Spatial locality a problem; tend to want next
– Instead copy the dirty block to a write buffer, then do the read,
sequential word, so not clear if benefit by early
and then do the write restart
– CPU stall less since restarts as soon as do read
block
3. Reduce Miss Penalty: Non- 4: Add a second-level cache
blocking Caches to reduce stalls on
misses • L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
• Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories AMAT = Hit TimeL1 +
• “hit under miss” reduces the effective miss penalty Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
by working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may • Definitions:
further lower the effective miss penalty by – Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)
overlapping multiple misses
– Global miss rate—misses in this cache divided by the total number of
– Significantly increases the complexity of the cache controller as memory accesses generated by the CPU
there can be multiple outstanding memory accesses (Miss RateL1 x Miss RateL2)
– Requires muliple memory banks (otherwise cannot support) – Global Miss Rate is what matters
– Penium Pro allows 4 outstanding memory misses

Comparing Local and Global


Miss Rates Reducing Misses:
• 32 KByte 1st level cache; Which apply to L2 Cache?
Increasing 2nd level cache Linear

• Global miss rate close to • Reducing Miss Rate


single level cache rate 1. Reduce Misses via Larger Block Size
provided L2 >> L1 2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
• Don’t use local miss rate Cache Size
4. Reducing Conflict Misses via Pseudo-Associativity
• L2 not tied to CPU clock 5. Reducing Misses by HW Prefetching Instr, Data
cycle! Log
6. Reducing Misses by SW Prefetching Data
• Cost & A.M.A.T. 7. Reducing Capacity/Conf. Misses by Compiler Optimizations
• Generally Fast Hit Times
and fewer misses
• Since hits are few, target
miss reduction Cache Size
L2 cache block size &
A.M.A.T.
Relative CPU Time
Reducing Miss Penalty Summary
⎛ Memory accesses ⎞
CPUtime = IC × CPI + × Miss rate × Miss penalty × Clock cycle time
⎝ Execution
Instruction ⎠
1.95
• Four techniques
2
1.9
1.8 – Read priority over write on miss
1.7 – Early Restart and Critical Word First on miss
1.6 1.54 – Non-blocking Caches (Hit under Miss, Miss under Miss)
1.5 – Second Level Cache
1.36 1.34
1.4
1.3
1.28 1.27 • Can be applied recursively to Multilevel Caches
1.2 – Danger is that time to DRAM will grow with multiple levels in
between
1.1
1 – First attempts at L2 caches can make things worse, since
increased worst case is worse
16 32 64 128 256 512
Block Size

• 32KB L1, 8 byte path to memory

What is the Impact of What


Cache Optimization Summary
You’ve Learned About Caches?
Technique MR MP HT Complexity
1000
CPU
• 1960-1985: Speed

miss rate
= ƒ(no. operations) Larger Block Size + – 0
Higher Associativity + – 1
• 1990 100 Victim Caches + 2
– Pipelined Pseudo-Associative Caches + 2
Execution & HW Prefetching of Instr/Data + 2
Fast Clock Rate Compiler Controlled Prefetching + 3
– Out-of-Order 10 Compiler Reduce Misses + 0
execution Priority to Read Misses + 1
miss penalty

– Superscalar DRAM Early Restart & Critical Word 1st + 2


Instruction Issue 1 Non-Blocking Caches + 3
Second Level Caches + 2
• 1998: Speed =
1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

ƒ(non-cached memory accesses)


• Superscalar, Out-of-Order machines hide L1 data cache miss
(­5 clocks) but not L2 cache miss (­50 clocks)?
1. Fast Hit times
Improving Cache Performance
via Small and Simple Caches
1. Reduce the miss rate, • Why Alpha 21164 has 8KB Instruction and
2. Reduce the miss penalty, or 8KB data cache + 96KB second level cache?
– Small data cache and clock rate
3. Reduce the time to hit in the cache.
• Direct Mapped, on chip

AMAT = HitTime + MissRate × MissPenalt y

2. Fast hits by Avoiding Address 3: Fast Hits by pipelining Cache


Translation Case Study: MIPS R4000
• Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical • 8 Stage Pipeline:
Cache – IF–first half of fetching of instruction; PC selection happens
– Every time process is switched logically must flush the cache; otherwise here as well as initiation of instruction cache access.
get false hits
– IS–second half of access to instruction cache.
» Cost is time to flush + “compulsory” misses from empty cache
– RF–instruction decode and register fetch, hazard checking and
– Dealing with aliases (sometimes called synonyms); also instruction cache hit detection.
Two different virtual addresses map to same physical address
– EX–execution, which includes effective address calculation, ALU
– I/O must interact with cache, so need virtual address operation, and branch target computation and condition
evaluation.
• Solution to aliases
– DF–data fetch, first half of access to data cache.
– HW guaranteess covers index field & direct mapped, they must be
unique; – DS–second half of access to data cache.
called page coloring – TC–tag check, determine whether the data cache access hit.
• Solution to cache flush – WB–write back for loads and register-register operations.
– Add process identifier tag that identifies process as well as address • What is impact on Load delay?
within process: can’t get a hit if wrong process – Need 2 instructions between a load and its use!
Case Study: MIPS R4000 Cache Optimization Summary
TWO Cycle IF IS RF EX DF DS TC WB
Technique MR MP HT Complexity
Load Latency IF IS RF EX DF DS TC
Larger Block Size + – 0

miss rate
IF IS RF EX DF DS
Higher Associativity + – 1
IF IS RF EX DF
Victim Caches + 2
IF IS RF EX Pseudo-Associative Caches + 2
IF IS RF HW Prefetching of Instr/Data + 2
IF IS Compiler Controlled Prefetching + 3
IF Compiler Reduce Misses + 0
IF IS RF EX DF DS TC WB Priority to Read Misses + 1
THREE Cycle

penalty
Early Restart & Critical Word 1st + 2

miss
Branch Latency IF IS RF EX DF DS TC
Non-Blocking Caches + 3
(conditions evaluated IF IS RF EX DF DS Second Level Caches + 2
during EX phase) IF IS RF EX DF Better memory system + 3
IF IS RF EX
Delay slot plus two stalls Small & Simple Caches – + 0
IF IS RF Avoiding Address Translation + 2

hit time
Branch likely cancels delay slot if not taken
IF IS Pipelining Caches + 2
IF

Exercise 4

• Solve the following problems in Chapter 5 of


Computer Architecture A Quantitative
Approach:
1-5-7-8-9
• Email the solutions by 85.3.17 to the:
[email protected]
Write in the title of the Email: HW4-ACA

You might also like