0% found this document useful (0 votes)

85 views13 pages

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

The memory hierarchy is important because the gap between CPU and memory speeds continues to grow significantly each year. Caches attempt to bridge this gap by providing faster but smaller memory closer to the CPU. Caches exploit locality by storing recently or frequently accessed data from slower further memory. Common cache organizations include direct mapped, set associative, and fully associative caches which balance access time, complexity, and flexibility.

Uploaded by

ahmadkhan82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views13 pages

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

Uploaded by

ahmadkhan82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Question: Who Cares About the

Memory Hierarchy?

1000 CPU
µProc
60%/yr.
Caches and Memory Systems I
“Moore’s Law”

Performance
CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1

1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1980
1981
1982
1983
1984
1985

1999
2000
1986
• 1980: no cache in µproc; 1995 2-level cache on chip
(1989 first Intel µproc with a cache on chip)

What is a cache? Example: 1 KB Direct Mapped Cache

• Small, fast storage used to improve average access • For a 2 ** N byte cache:
time to slow memory. – The uppermost (32 - N) bits are always the Cache Tag
• Exploits spacial and temporal locality – The lowest M bits are the Byte Select (Block Size = 2 ** M)

• In computer architecture, almost everything is a cache! Block address

31 9 4 0
– Registers a cache on variables
Cache Tag Example: 0x50 Cache Index Byte Select
– First-level cache a cache on second-level cache Ex: 0x01 Ex: 0x00
– Second-level cache a cache on memory Stored as part
– Memory a cache on disk (virtual memory) of the cache “state”

– TLB a cache on page table Valid Bit Cache Tag Cache Data
– Branch-prediction a cache on prediction information?

: :
Byte 31 Byte 1 Byte 0 0
Proc/Regs 0x50 Byte 63 Byte 33 Byte 32 1
2
L1-Cache 3
Bigger L2-Cache Faster
: : :
Memory

:
Byte 1023 Byte 992 31
Disk, Tape, etc.
Set Associative Cache Disadvantage of Set Associative Cache
• N-way set associative: N entries for each Cache
Index • N-way Set Associative Cache versus Direct Mapped
– N direct mapped caches operates in parallel Cache:
– N comparators vs. 1
• Example: Two-way set associative cache – Extra MUX delay for the data
– Cache Index selects a “set” from the cache – Data comes AFTER Hit/Miss decision and set selection
– The two tags in the set are compared to the input in parallel
• In a direct mapped cache, Cache Block is available
– Data is selected based on the tag result
BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if miss.
Cache Index Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0 Cache Block 0 Cache Block 0

: : : : : : : : : : : :

Adr Tag Adr Tag

Compare Sel1 1 Mux 0 Sel0 Compare Compare Sel1 1 Mux 0 Sel0 Compare

OR OR
Cache Block Cache Block
Hit Hit

Review: Cache performance Impact on Performance

• Suppose a processor executes at
• Miss-oriented Approach to Memory Access:
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
⎛ MemAccess ⎞ – 50% arith/logic, 30% ld/st, 20% control
CPUtime = IC × ⎜ CPI + × MissRate × MissPenalt y ⎟ × CycleTime
⎝ Execution Inst ⎠ • Suppose that 10% of memory operations get 50 cycle
⎛ MemMisses ⎞ miss penalty
CPUtime = IC × ⎜ CPI + × MissPenalty ⎟ × CycleTime
⎝ Execution Inst ⎠ • Suppose that 1% of instructions get same miss penalty
– CPIExecution includes ALU and Memory instructions • CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
• Separating out Memory component entirely [ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
– AMAT = Average Memory Access Time
[ 1 (InstMop/ins)
– CPIALUOps does not include memory instructions
x 0.01 (miss/InstMop) x 50 (cycle/miss)]
⎛ AluOps
CPUtime = IC × ⎜ × CPI +
MemAccess ⎞
× AMAT ⎟ × CycleTime = (1.1 + 1.5 + .5) cycle/ins = 3.1
⎝ Inst ⎠ • 58% of the time the proc is stalled waiting for memory!
AluOps
Inst
AMAT = HitTime + MissRate × MissPenalt y
• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
= ( HitTime Inst + MissRate Inst × MissPenalty Inst ) +
( HitTime Data + MissRate Data × MissPenalty Data )
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)
Review: Four Questions for
Memory Hierarchy Designers
Proc
I-Cache-1 Proc D-Cache-1
Unified
Cache-1 Unified
• Q1: Where can a block be placed in the upper level?
Unified
Cache-2 (Block placement)
Cache-2 – Fully Associative, Set Associative, Direct Mapped
• Table on page 384: • Q2: How is a block found if it is in the upper level?
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% (Block identification)
– 32KB unified: Aggregate miss rate=1.99% – Tag/Block
• Which is better (ignore L2 cache)? • Q3: Which block should be replaced on a miss?
– Assume 33% data ops ⇒ 75% accesses from instructions (1.0/1.33) (Block replacement)
– hit time=1, miss time=50 – Random, LRU
– Note that data hit has 1 stall for unified cache (only one port)
• Q4: What happens on a write?
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 (Write strategy)
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24 – Write Back or Write Through (with Write Buffer)

Review: Improving Cache Reducing Misses

Performance • Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache,
so the block must be brought into the cache. Also called cold
1. Reduce the miss rate, start misses or first reference misses.
2. Reduce the miss penalty, or (Misses in even an Infinite Cache)

3. Reduce the time to hit in the cache. – Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
2:1 Cache Rule
3Cs Absolute Miss Rate
(SPEC92) miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14 0.14
1-way 1-way
0.12 Conflict 0.12 Conflict
2-way 2-way
0.1 0.1
4-way 4-way
0.08 0.08
8-way 8-way
0.06 0.06
Capacity Capacity
0.04 0.04
0.02 0.02
0 0
1

8
16

64
128

128
Compulsory vanishingly Compulsory Compulsory
Cache Size (KB) Cache Size (KB)
small

3Cs Relative Miss Rate How Can Reduce Misses?

100% • 3 Cs: Compulsory, Capacity, Conflict
1-way
• In all cases, assume total cache size not changed:
80% Conflict
2-way • What happens if:
4-way
60% 8-way 1) Change Block Size:
Which of 3Cs is obviously affected?
40%
Capacity
2) Change Associativity:
20% Which of 3Cs is obviously affected?

0% 3) Change Compiler:
Which of 3Cs is obviously affected?
1

128

Flaws: for fixed block size

Good: insight => invention Cache Size (KB) Compulsory
1. Reduce Misses via Larger 2. Reduce Misses via Higher
Block Size Associativity
25%
• 2:1 Cache Rule:
1K – Miss Rate DM cache size N Miss Rate 2-way cache
20%
size N/2

15%
4K
• Beware: Execution time is only final measure!
Miss
16K – Will Clock Cycle time increase?
Rate
10% – Hill [1988] suggested hit time for 2-way vs. 1-way
64K external cache +10%,
internal + 2%
5% 256K

0%
16

128

Block Size (bytes) 256

3. Reducing Misses via a

Example: Avg. Memory Access “Victim Cache”
Time vs. Miss Rate
• How to combine fast hit time
• Example: assume CCT = 1.10 for 2-way, 1.12 for of direct mapped
4-way, 1.14 for 8-way vs. CCT direct mapped yet still avoid conflict misses?
TAGS DATA
Cache Size Associativity
• Add buffer to place data
(KB) 1-way 2-way 4-way 8-way
discarded from cache
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68 • Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of
4 1.72 1.67 1.61 1.53
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
8 1.46 1.48 1.47 1.43
mapped data cache
16 1.29 1.32 1.32 1.32 Tag and Comparator One Cache line of Data
32 1.20 1.24 1.25 1.27
• Used in Alpha, HP machines Tag and Comparator One Cache line of Data
64 1.14 1.20 1.21 1.23 Tag and Comparator One Cache line of Data
128 1.10 1.17 1.18 1.20
To Next Lower Level In
Hierarchy
(Red means A.M.A.T. not improved by more associativity)
4. Reducing Misses via
“Pseudo-Associativity” 5. Reducing Misses by Hardware
Prefetching of Instructions & Datals
• How to combine fast hit time of Direct Mapped and
have the lower conflict misses of 2-way SA cache? • E.g., Instruction Prefetching
• Divide cache: on a miss, check other half of cache to – Alpha 21064 fetches 2 blocks on a miss
see if there, if so have a pseudo-hit (slow hit) – Extra block placed in “stream buffer”
– On miss check stream buffer
Hit Time • Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses from
Pseudo Hit Time Miss Penalty 4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8
streams got 50% to 70% of misses from
Time 2 64KB, 4-way set associative caches
• Drawback: CPU pipeline is hard if hit takes 1 or 2 • Prefetching relies on having extra memory
cycles bandwidth that can be used without penalty
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

6. Reducing Misses by 7. Reducing Misses by

Software Prefetching Data Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
• Data Prefetch on 8KB direct mapped cache, 4 byte blocks in software
– Load data into register (HP PA-RISC loads)
• Instructions
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Reorder procedures in memory so as to reduce conflict misses
– Special prefetching instructions cannot cause faults; a form of
speculative execution – Profiling to look at conflicts(using tools they developed)

• Prefetching comes in two flavors: • Data

– Merging Arrays: improve spatial locality by single array of compound elements
– Binding prefetch: Requests load directly into register. vs. 2 arrays
» Must be correct address and register! – Loop Interchange: change nesting of loops to access data in order stored in
– Non-Binding prefetch: Load into cache. memory
» Can be incorrect. Frees HW/SW to guess! – Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
• Issuing Prefetch Instructions takes time – Blocking: Improve temporal locality by accessing “blocks” of data repeatedly
– Is cost of prefetch issues < savings in reduced misses? vs. going down whole columns or rows
– Higher superscalar reduces difficulty of issue bandwidth
Merging Arrays Example Loop Interchange Example

/* Before: 2 sequential arrays / / Before */

int val[SIZE]; for (k = 0; k < 100; k = k+1)
int key[SIZE]; for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
/* After: 1 array of stuctures */ x[i][j] = 2 * x[i][j];
struct merge { /* After */
int val; for (k = 0; k < 100; k = k+1)
int key; for (i = 0; i < 5000; i = i+1)
}; for (j = 0; j < 100; j = j+1)
struct merge merged_array[SIZE]; x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding

Reducing conflicts between val & key; through memory every 100 words; improved
improve spatial locality spatial locality

Blocking Example
Loop Fusion Example /* Before */
/* Before */
for (i = 0; i < N; i = i+1) for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j]; {r = 0;
for (i = 0; i < N; i = i+1) for (k = 0; k < N; k = k+1){
for (j = 0; j < N; j = j+1) r = r + y[i][k]*z[k][j];};
d[i][j] = a[i][j] + c[i][j]; x[i][j] = r;
/* After */ };
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) • Two Inner Loops:
{ a[i][j] = 1/b[i][j] * c[i][j]; – Read all NxN elements of z[]
d[i][j] = a[i][j] + c[i][j];} – Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
2 misses per access to a & c vs. one miss per • Capacity Misses a function of N & Cache Size:
access; improve spatial locality – 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
Summary of Compiler Optimizations to
Blocking Example Reduce Cache Misses (by hand)
vpenta (nasa7)
/* After */
for (jj = 0; jj < N; jj = jj+B) gmty (nasa7)
for (kk = 0; kk < N; kk = kk+B) tomcatv
for (i = 0; i < N; i = i+1)
btrix (nasa7)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0; mxm (nasa7)
for (k = kk; k < min(kk+B-1,N); k = k+1) {
spice
r = r + y[i][k]*z[k][j];}; cholesky
x[i][j] = x[i][j] + r; (nasa7)
}; compress

• B called Blocking Factor 1 1.5 2 2.5 3

• Capacity Misses from 2N3 + N2 to N3/B+2N2 Performance Improvement
• Conflict Misses Too?
merged loop loop fusion blocking
arrays interchange

Review: Improving Cache

Summary: Miss Rate Reduction
⎛ Memory accesses ⎞
Performance
CPUtime = IC × CPI + × Miss rate × Miss penalty × Clock cycle time
⎝ Execution
Instruction ⎠
1. Reduce the miss rate,
• 3 Cs: Compulsory, Capacity, Conflict
1. Reduce Misses via Larger Block Size 2. Reduce the miss penalty, or
2. Reduce Misses via Higher Associativity 3. Reduce the time to hit in the cache.
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
Write Policy: 1. Reducing Miss Penalty:
Write-Through vs Write-Back Read Priority over Write on Miss
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory CPU
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache in out
Write Buffer
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple times
» Better tolerance to long-latency memory?
write
buffer

DRAM
(or lower mem)

2. Reduce Miss Penalty:

1. Reducing Miss Penalty: Early Restart and Critical Word
Read Priority over Write on Miss First
• Don’t wait for full block to be loaded before
• Write-through with write buffers offer RAW conflicts restarting CPU
with main memory reads on cache misses – Early restart—As soon as the requested word of the block
ar rives, send it to the CPU and let the CPU continue execution
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% ) – Critical Word First—Request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU continue
– Check write buffer contents before read; execution while filling the rest of the words in the block. Also
if no conflicts, let the memory access continue called wrapped fetch and requested word first
• Write-back also want buffer to hold misplaced blocks • Generally useful only in large blocks,
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
• Spatial locality a problem; tend to want next
– Instead copy the dirty block to a write buffer, then do the read,
sequential word, so not clear if benefit by early
and then do the write restart
– CPU stall less since restarts as soon as do read
block
3. Reduce Miss Penalty: Non- 4: Add a second-level cache
blocking Caches to reduce stalls on
misses • L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
• Non-blocking cache or lockup-free cache allow data
cache to continue to supply cache hits during a miss Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories AMAT = Hit TimeL1 +
• “hit under miss” reduces the effective miss penalty Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
by working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may • Definitions:
further lower the effective miss penalty by – Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)
overlapping multiple misses
– Global miss rate—misses in this cache divided by the total number of
– Significantly increases the complexity of the cache controller as memory accesses generated by the CPU
there can be multiple outstanding memory accesses (Miss RateL1 x Miss RateL2)
– Requires muliple memory banks (otherwise cannot support) – Global Miss Rate is what matters
– Penium Pro allows 4 outstanding memory misses

Comparing Local and Global

Miss Rates Reducing Misses:
• 32 KByte 1st level cache; Which apply to L2 Cache?
Increasing 2nd level cache Linear

• Global miss rate close to • Reducing Miss Rate

single level cache rate 1. Reduce Misses via Larger Block Size
provided L2 >> L1 2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
• Don’t use local miss rate Cache Size
4. Reducing Conflict Misses via Pseudo-Associativity
• L2 not tied to CPU clock 5. Reducing Misses by HW Prefetching Instr, Data
cycle! Log
6. Reducing Misses by SW Prefetching Data
• Cost & A.M.A.T. 7. Reducing Capacity/Conf. Misses by Compiler Optimizations
• Generally Fast Hit Times
and fewer misses
• Since hits are few, target
miss reduction Cache Size
L2 cache block size &
A.M.A.T.
Relative CPU Time
Reducing Miss Penalty Summary
⎛ Memory accesses ⎞
CPUtime = IC × CPI + × Miss rate × Miss penalty × Clock cycle time
⎝ Execution
Instruction ⎠
1.95
• Four techniques
2
1.9
1.8 – Read priority over write on miss
1.7 – Early Restart and Critical Word First on miss
1.6 1.54 – Non-blocking Caches (Hit under Miss, Miss under Miss)
1.5 – Second Level Cache
1.36 1.34
1.4
1.3
1.28 1.27 • Can be applied recursively to Multilevel Caches
1.2 – Danger is that time to DRAM will grow with multiple levels in
between
1.1
1 – First attempts at L2 caches can make things worse, since
increased worst case is worse
16 32 64 128 256 512
Block Size

• 32KB L1, 8 byte path to memory

What is the Impact of What

Cache Optimization Summary
You’ve Learned About Caches?
Technique MR MP HT Complexity
1000
CPU
• 1960-1985: Speed

miss rate
= ƒ(no. operations) Larger Block Size + – 0
Higher Associativity + – 1
• 1990 100 Victim Caches + 2
– Pipelined Pseudo-Associative Caches + 2
Execution & HW Prefetching of Instr/Data + 2
Fast Clock Rate Compiler Controlled Prefetching + 3
– Out-of-Order 10 Compiler Reduce Misses + 0
execution Priority to Read Misses + 1
miss penalty

– Superscalar DRAM Early Restart & Critical Word 1st + 2

Instruction Issue 1 Non-Blocking Caches + 3
Second Level Caches + 2
• 1998: Speed =
1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

ƒ(non-cached memory accesses)

• Superscalar, Out-of-Order machines hide L1 data cache miss
(5 clocks) but not L2 cache miss (50 clocks)?
1. Fast Hit times
Improving Cache Performance
via Small and Simple Caches
1. Reduce the miss rate, • Why Alpha 21164 has 8KB Instruction and
2. Reduce the miss penalty, or 8KB data cache + 96KB second level cache?
– Small data cache and clock rate
3. Reduce the time to hit in the cache.
• Direct Mapped, on chip

AMAT = HitTime + MissRate × MissPenalt y

2. Fast hits by Avoiding Address 3: Fast Hits by pipelining Cache

Translation Case Study: MIPS R4000
• Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical • 8 Stage Pipeline:
Cache – IF–first half of fetching of instruction; PC selection happens
– Every time process is switched logically must flush the cache; otherwise here as well as initiation of instruction cache access.
get false hits
– IS–second half of access to instruction cache.
» Cost is time to flush + “compulsory” misses from empty cache
– RF–instruction decode and register fetch, hazard checking and
– Dealing with aliases (sometimes called synonyms); also instruction cache hit detection.
Two different virtual addresses map to same physical address
– EX–execution, which includes effective address calculation, ALU
– I/O must interact with cache, so need virtual address operation, and branch target computation and condition
evaluation.
• Solution to aliases
– DF–data fetch, first half of access to data cache.
– HW guaranteess covers index field & direct mapped, they must be
unique; – DS–second half of access to data cache.
called page coloring – TC–tag check, determine whether the data cache access hit.
• Solution to cache flush – WB–write back for loads and register-register operations.
– Add process identifier tag that identifies process as well as address • What is impact on Load delay?
within process: can’t get a hit if wrong process – Need 2 instructions between a load and its use!
Case Study: MIPS R4000 Cache Optimization Summary
TWO Cycle IF IS RF EX DF DS TC WB
Technique MR MP HT Complexity
Load Latency IF IS RF EX DF DS TC
Larger Block Size + – 0

miss rate
IF IS RF EX DF DS
Higher Associativity + – 1
IF IS RF EX DF
Victim Caches + 2
IF IS RF EX Pseudo-Associative Caches + 2
IF IS RF HW Prefetching of Instr/Data + 2
IF IS Compiler Controlled Prefetching + 3
IF Compiler Reduce Misses + 0
IF IS RF EX DF DS TC WB Priority to Read Misses + 1
THREE Cycle

penalty
Early Restart & Critical Word 1st + 2

miss
Branch Latency IF IS RF EX DF DS TC
Non-Blocking Caches + 3
(conditions evaluated IF IS RF EX DF DS Second Level Caches + 2
during EX phase) IF IS RF EX DF Better memory system + 3
IF IS RF EX
Delay slot plus two stalls Small & Simple Caches – + 0
IF IS RF Avoiding Address Translation + 2

hit time
Branch likely cancels delay slot if not taken
IF IS Pipelining Caches + 2
IF

Exercise 4

• Solve the following problems in Chapter 5 of

Computer Architecture A Quantitative
Approach:
1-5-7-8-9
• Email the solutions by 85.3.17 to the:
[email protected]
Write in the title of the Email: HW4-ACA

Galileo Reservation Manual
No ratings yet
Galileo Reservation Manual
47 pages
Chapter 2 Adv 2007 PPTV 4
No ratings yet
Chapter 2 Adv 2007 PPTV 4
54 pages
Memory Hierarchy 4.0
No ratings yet
Memory Hierarchy 4.0
50 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Gemscab PVC Cables PDF
No ratings yet
Gemscab PVC Cables PDF
28 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Lec 4a
No ratings yet
Lec 4a
25 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
Cache Org
No ratings yet
Cache Org
19 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Cache Performance
No ratings yet
Cache Performance
41 pages
Cache Memory
No ratings yet
Cache Memory
13 pages
Week 13 - Lecture 13 - Memory (Cont)
No ratings yet
Week 13 - Lecture 13 - Memory (Cont)
31 pages
24-Cache Memory Mapping Techniques-14!03!2024
No ratings yet
24-Cache Memory Mapping Techniques-14!03!2024
36 pages
Cache 1 54
No ratings yet
Cache 1 54
54 pages
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
No ratings yet
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
38 pages
10 Cacheperf
No ratings yet
10 Cacheperf
24 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Improving Cache Performance
No ratings yet
Improving Cache Performance
24 pages
CS252 Graduate Computer Architecture Caches and Memory Systems I
No ratings yet
CS252 Graduate Computer Architecture Caches and Memory Systems I
49 pages
Land Rover Discovery II Fuse Box Integrated Relay Repair
100% (5)
Land Rover Discovery II Fuse Box Integrated Relay Repair
9 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
Catalogo 12 Fisher PDF
No ratings yet
Catalogo 12 Fisher PDF
348 pages
CMSC 611: Advanced Computer Architecture
No ratings yet
CMSC 611: Advanced Computer Architecture
21 pages
Cache Mapping
No ratings yet
Cache Mapping
23 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Chapter 5.1-5.6 Memory
No ratings yet
Chapter 5.1-5.6 Memory
26 pages
CA Lecture 08
No ratings yet
CA Lecture 08
38 pages
Lecture 13 - Introduction To Cache
No ratings yet
Lecture 13 - Introduction To Cache
47 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
10 Caches
No ratings yet
10 Caches
34 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
No ratings yet
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
7 pages
Cache Memory: A Safe Place For Hiding or Storing Things
No ratings yet
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Cache
No ratings yet
Cache
34 pages
2015Sp CS61C L16 Kavs Caches3
No ratings yet
2015Sp CS61C L16 Kavs Caches3
25 pages
361 Computer Architecture Lecture 14: Cache Memory
No ratings yet
361 Computer Architecture Lecture 14: Cache Memory
20 pages
5 1
No ratings yet
5 1
39 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Cache Memory Performance
No ratings yet
Cache Memory Performance
10 pages
Lec 19
No ratings yet
Lec 19
19 pages
MCCB-With Microprocessor Release MTX2.0
100% (1)
MCCB-With Microprocessor Release MTX2.0
1 page
Zapi Sem-2 Manual
100% (1)
Zapi Sem-2 Manual
38 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
Fundamentals of Computer Systems: Caches
No ratings yet
Fundamentals of Computer Systems: Caches
28 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
No ratings yet
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
52 pages
Master Your DSLR
100% (3)
Master Your DSLR
29 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
Computer Architecture: Memory Organization
No ratings yet
Computer Architecture: Memory Organization
65 pages
Distributed Database Systems-Chhanda Ray-1
100% (2)
Distributed Database Systems-Chhanda Ray-1
20 pages
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
No ratings yet
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
18 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
C Command Connect Sales Presentation PDF
No ratings yet
C Command Connect Sales Presentation PDF
29 pages
Cache Memory
No ratings yet
Cache Memory
28 pages
ES5462 Service Manual
No ratings yet
ES5462 Service Manual
226 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
308658.MIPRO2007 Petrinic PDF
No ratings yet
308658.MIPRO2007 Petrinic PDF
6 pages
Cache Design
No ratings yet
Cache Design
59 pages
Basic Electrical and Electronics Engineering (Ojee)
No ratings yet
Basic Electrical and Electronics Engineering (Ojee)
5 pages
SPR, RCS-9627CN, No
No ratings yet
SPR, RCS-9627CN, No
5 pages
NL11MF Standard Compressor R134a 220-240V 50Hz: General
No ratings yet
NL11MF Standard Compressor R134a 220-240V 50Hz: General
2 pages
HC05 Bluetooth
No ratings yet
HC05 Bluetooth
16 pages
WhatsNewIn iPhoneOS5
No ratings yet
WhatsNewIn iPhoneOS5
74 pages
LG LED TVs Are LCD TVs With LED Backlighting
No ratings yet
LG LED TVs Are LCD TVs With LED Backlighting
37 pages
Panasonic Nn-gf560m SM
No ratings yet
Panasonic Nn-gf560m SM
33 pages
FBC Service-And-Metering-Guide 2023
No ratings yet
FBC Service-And-Metering-Guide 2023
86 pages
T - GR31IZIDDCCR (Rekstraksi)
No ratings yet
T - GR31IZIDDCCR (Rekstraksi)
6 pages
n-1 n-1 -j2πnk/N n=0: Explain the Properties of 2D discrete Fourier Transform 1. Separability
No ratings yet
n-1 n-1 -j2πnk/N n=0: Explain the Properties of 2D discrete Fourier Transform 1. Separability
7 pages
Am NH
No ratings yet
Am NH
6 pages
P109 113 PDF
No ratings yet
P109 113 PDF
5 pages
2 - LCD em Problem Ito
No ratings yet
2 - LCD em Problem Ito
7 pages
CPI - ProFlo PFI Flyer Rev 8-12
No ratings yet
CPI - ProFlo PFI Flyer Rev 8-12
2 pages
Motorola Driver Log
No ratings yet
Motorola Driver Log
4 pages
A Whirlwind Tour of Python 2
No ratings yet
A Whirlwind Tour of Python 2
3 pages
Cs 220 Usb/Fm: AC GND AC Ac 8V GND Ac 8V 1A Rede 220V 10KX1000V
No ratings yet
Cs 220 Usb/Fm: AC GND AC Ac 8V GND Ac 8V 1A Rede 220V 10KX1000V
1 page
Solar Energy Conversion Project Report
No ratings yet
Solar Energy Conversion Project Report
3 pages
Digital Multimeter MM 600 Trms
No ratings yet
Digital Multimeter MM 600 Trms
1 page

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

Uploaded by

Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I

Uploaded by

Question: Who Cares About the

What is a cache? Example: 1 KB Direct Mapped Cache

• In computer architecture, almost everything is a cache! Block address

Adr Tag Adr Tag

Review: Cache performance Impact on Performance

Review: Improving Cache Reducing Misses

3Cs Relative Miss Rate How Can Reduce Misses?

Flaws: for fixed block size

Block Size (bytes) 256

3. Reducing Misses via a

6. Reducing Misses by 7. Reducing Misses by

• Prefetching comes in two flavors: • Data

/* Before: 2 sequential arrays */ /* Before */

Sequential accesses instead of striding

• B called Blocking Factor 1 1.5 2 2.5 3

Review: Improving Cache

2. Reduce Miss Penalty:

Comparing Local and Global

• Global miss rate close to • Reducing Miss Rate

• 32KB L1, 8 byte path to memory

What is the Impact of What

– Superscalar DRAM Early Restart & Critical Word 1st + 2

ƒ(non-cached memory accesses)

AMAT = HitTime + MissRate × MissPenalt y

2. Fast hits by Avoiding Address 3: Fast Hits by pipelining Cache

• Solve the following problems in Chapter 5 of

You might also like

/* Before: 2 sequential arrays / / Before */