0% found this document useful (0 votes)

127 views25 pages

Lecture 5 Cache Optimization

Uploaded by

Tayyaba Asif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views25 pages

Lecture 5 Cache Optimization

Uploaded by

Tayyaba Asif

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 5: Cache Optimization

Appendix B and Ch 2

DAP Spr.‘98 ©UCB 1

How to Improve Cache
Performance?

AMAT  HitTime  MissRate  MissPenalt y

1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
4. Increase bandwidth

DAP Spr.‘98 ©UCB 2

Introduction
Memory Hierarchy Basics
• Basic cache optimizations:
– Larger block size
» Reduces compulsory misses
» Increases capacity and conflict misses, increases miss
penalty
– Larger total cache capacity to reduce miss rate
» Increases hit time, increases power consumption
– Higher associativity
» Reduces conflict misses
» Increases hit time, increases power consumption

DAP Spr.‘98 ©UCB 3

Larger Block Size
25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K
Reduced
compulsory 0%
16

misses 128

256
Increased
Conflict
Block Size (bytes)
Misses

DAP Spr.‘98 ©UCB 4

Pseudo-set associative Cache

Pseudo-set associative cache

• A pseudoassociative cache is between a direct-mapped and set-
associative cache. For a set-associative cache, all entries in the set
are accessed in parallel. This slows down the access. In a pseudo-
associative cache, we view each "way" of the set as a separate
direct-mapped cache. They are accessed in sequence, not in
parallel. This saves time if the item is found in the first "way", but
wastes time if it is found in the last "way."
• On an access, you first try the first "way", then the second "way",
etc., until you get to the nth "way".
• On a hit, if it is found in the kth way, then the line is promoted to
the first way, and all lines in caches 1 to k-1 get demoted one
cache.
• On a miss, the item is placed in also placed in the first way, the
item in the nth way is evicted, and all items in the 1 to n-1 way are
demoted one cache.

DAP Spr.‘98 ©UCB 5

DAP Spr.‘98 ©UCB 6
Fast Hit Time + Low Conflict =>
Victim Cache
• How to combine fast hit time of
direct mapped, yet still avoid
conflict misses? TAGS DATA
• Add buffer to place data
discarded from cache
• Check both the cache and
victim buffer simultaneously on
data request from the CPU
Tag and Comparator One Cache line of Data
• Jouppi [1990]: 4-entry victim
cache removed 20% to 95% of Tag and Comparator One Cache line of Data
conflicts for a 4 KB direct Tag and Comparator One Cache line of Data
mapped data cache
Tag and Comparator One Cache line of Data
• Used in Alpha, HP machines
• Opteron L3 cache is a victim To Next Lower Level In
Hierarchy
cache
DAP Spr.‘98 ©UCB 7
Reducing Misses by Hardware Prefetching
of Instructions & Data
• E.g., Instruction Prefetching
– Sequential prefetch or block prefetching
– Most processors fetch 2 blocks of instructions on a miss
– Cache Pollution if fetched block is unused!
– Extra block placed in ―stream buffer‖
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer satisfied 25% misses from 4KB
cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs for 8 streams
satisfied 50% to 70% of misses from 2 64KB, 4-way set associative
caches
– Data Prediction is difficult, but works well with scientific applications
• Prefetching relies on having extra memory bandwidth that
can be used without penalty
• Question: What to prefetch and when to prefetch?
Instruction prefetch is fine, but data?
DAP Spr.‘98 ©UCB 8
Advanced Optimizations
Hardware Prefetching
• Fetch two blocks on miss (next sequential block)

Pentium 4 Pre-fetching
Intel Core i7 supports hardware prefetching to both L1 and L2 caches

DAP Spr.‘98 ©UCB 9

Leave it to the Programmer?
Software Prefetching Data
• Data Prefetch – Explicit prefetch instructions
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS, PowerPC, SPARC)
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Very suitable for prefetching from main memory
• Issuing Prefetch Instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth

DAP Spr.‘98 ©UCB 10

Compiler Optimization to Reduce
Miss Rate
• Nested loops may access data in memory non-
sequentially (cache misses).
• Exchange the nesting of loops can make the code
access the data in order, reduce cache misses.
• EX: If X is a two-dimensional array of size [5000,
100], allocated as row major, i.e. X(I,j) next X(I,j+1),
then modify the program as below.

/*Before*/ /*After*/
for(j=0;j<100;j++) for(i=0;i<5000;i++)
for(i=0;i<5000;i++) for(j=0;j<100;j++)
x[i][j]=2*x[i][j] x[i][j]=2*x[i][j]
DAP Spr.‘98 ©UCB 11
EX. Block Matrix Algorithm
• Operate on submatrices (blocks) instead of entire
row or columns.
• The submatrices can fit into cache.

/*Before*/ /*After*/
for(i=0;i<N;i++) for(jj=0;jj<N;jj=jj+B) // among blocks
for(j=0;j<N;j++){ for(kk=0;kk<N;kk=kk+B) // among blocks
r=0; for(i=0;i<N;i++)
for(k=0;k<N;k++) for(j=jj;j<jj+B;j++){ // within a block
r=r+y[i][k]*z[k][j]; r=0;
x[i][j] = r;} for(k=kk;k<kk+B;k++) // within a block
r=r+y[i][k]*z[k][j];
x[i][j] = x[i][j] +r;
}

DAP Spr.‘98 ©UCB 12

Figure 2.8 A snapshot of the three arrays x, y, and z when N = 6 and i = 1. The age of accesses to the array elements is
indicated by shade: white means not yet touched, light means older accesses, and dark means newer accesses.
Compared to Figure 2.9, elements of y and z are read repeatedly to calculate new elements of x. The variables i, j, and k
are shown along the rows or columns used to access the arrays.

Figure 2.9 The age of accesses to the arrays x, y, and z when B = 3. Note that, in contrast to Figure 2.8, a smaller number
of elements is accessed.

Summary: Miss Rate Reduction
 Memory accesses 
CPUtime  IC  CPI   Miss rate Miss penalty  Clock cycle time
 Executi on
Instruction 
• 3 Cs: Compulsory, Capacity, Conflict
0. Larger cache
1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!

Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
4. Increase bandwidth

1. Reduce Miss Penalty:
Early Restart and Critical Word First
• Don’t wait for full block to be loaded before restarting
CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
– Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word first
• Generally useful only in large blocks,
• Spatial locality => tend to want next sequential word,
so not clear if benefit by early restart

block

2. Reducing Miss Penalty:
Read Priority over Write on Miss
• Give priority to reads over writes on a miss by putting
the writes in a write buffer
• Write-through with write buffers => RAW conflicts with
main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back want buffer to hold displaced blocks
– Consider when a Read miss is replacing a dirty block
– Normally: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read DAP Spr.‘98 ©UCB 18
Write Buffers
• Write Buffers puts words to
be written in L2
cache/memory along with
their addresses. L1 to CPU
– 2 to 4 entries deep
– all read misses are checked Write buffer
against pending writes for
dependencies (associatively)
– allows reads to proceed L2
ahead of writes
– can coalesce writes to same
block address to reduce time
(next slide)

Merging Write Buffers to
Reduce Miss Penalty
• Write buffer to allow processor to continue
while waiting to write to memory
• If buffer contains modified blocks, the
addresses can be checked to see if address
of new data matches the address of a valid
write buffer entry
• If so, new data are combined with that entry
• Increases block size of write for write-
through cache of writes to sequential words,
bytes since multiword writes more efficient
to memory
• The Sun T1 (Niagara) processor, among
many others, uses write merging

Write Merge in Write Buffers

4: Add a second-level cache

• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 +
Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)
• Definitions:
– Local miss rate— misses in this cache divided by the total number
of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
Global Miss Rate is what matters
DAP Spr.‘98 ©UCB 22
Comparing Local and Global
• 32 KB 1st level cache; Miss Rates
Increasing 2nd level cache
• Local miss rate is for L2 – Very
high for small L2 size
• Single cache miss rate is the rate
if we have one cache of size in x-
axis
• Global miss rate close to single Cache Size
level cache rate provided L2 >>
L1
Log
• The idea is to reduce miss
penalty without increasing the
miss rate
• L1 speed affects the CPU clock
cycle, but not L2 speed, L2 only
affects the miss penalty of the
first-level cache
DAP Spr.‘98 ©UCB 23
AMAT Example
• For every 1000 memory references, assume 40
misses in L1 and 20 misses in L2;
Hit time in L1 is 1, L2 is 10; Miss penalty from L2 to
memory is 100 cycles; there are 1.5 memory
references per instruction. What is AMAT and
average stall cycles per instruction?
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2
– AMAT = [1 + 40/1000 * (10 + 20/40 * 100) ] *cc = 3.4 cycles
– AMAT without L2 = 1 + 40/1000 * 100 = 5 cycles => An
improvement of 1.6 cycles due to L2
• Average memory stalls per instruction = Misses per instructionL1 × Hit
timeL2 + Misses per instructionL2 × Miss penaltyL2
– Average stall cycles per instruction = 1.5 * 40/1000 * 10 + 1.5 *
20/1000 * 100 = 3.6 cycles
• Note: We have not distinguished reads and writes.
Access L2 only on L1 miss, No separate I-cache and
D-cache
DAP Spr.‘98 ©UCB 24
Reducing Miss Penalty Summary
 Memory accesses 
CPUtime  IC  CPI   Miss rate Miss penalty  Clock cycle time
 Executi on
Instruction 
• Four techniques
1. Read priority over write on miss
2. Early Restart and Critical Word First on miss
3. Write Buffer
4. Second Level Cache
• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple
levels of cache memories
– First attempts (compulsory misses) at L2 caches can
make things worse, since increased worst case is worse

Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
Optimize Cache Performance Techniques
No ratings yet
Optimize Cache Performance Techniques
41 pages
Lecture 7
No ratings yet
Lecture 7
21 pages
10 Caches
No ratings yet
10 Caches
34 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Cache
No ratings yet
Cache
34 pages
Lec 34
No ratings yet
Lec 34
26 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
4 pages
202004221613338445rohit Engg Advance Opt of Cache
No ratings yet
202004221613338445rohit Engg Advance Opt of Cache
9 pages
Memory Hierarchy Design in Computer Architecture
No ratings yet
Memory Hierarchy Design in Computer Architecture
18 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
Cache 2 Output
No ratings yet
Cache 2 Output
37 pages
Cache 2
No ratings yet
Cache 2
37 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
Cache Performance Optimization Techniques
No ratings yet
Cache Performance Optimization Techniques
6 pages
Chapter # 05
No ratings yet
Chapter # 05
42 pages
Cache Optimizations
No ratings yet
Cache Optimizations
29 pages
Intel Optane Memory Business Overview
No ratings yet
Intel Optane Memory Business Overview
27 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Coa PPT
No ratings yet
Coa PPT
158 pages
Week 13 - Lecture 13 - Memory (Cont)
No ratings yet
Week 13 - Lecture 13 - Memory (Cont)
31 pages
Memory Hierarchy and Cache Optimization
No ratings yet
Memory Hierarchy and Cache Optimization
36 pages
Memory Hierarchy Design Guide
No ratings yet
Memory Hierarchy Design Guide
115 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
10 Cacheperf
No ratings yet
10 Cacheperf
24 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
Memory Hierarchy Design Guide
No ratings yet
Memory Hierarchy Design Guide
54 pages
Cache Performance and Write Strategies
No ratings yet
Cache Performance and Write Strategies
30 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Cache Optimization Techniques Overview
No ratings yet
Cache Optimization Techniques Overview
14 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
ELT3047 Computer Architecture: Lecture 10: Associative Cache
No ratings yet
ELT3047 Computer Architecture: Lecture 10: Associative Cache
26 pages
Lec 19
No ratings yet
Lec 19
19 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Memory Hierarchy and Cache Optimization
No ratings yet
Memory Hierarchy and Cache Optimization
20 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Cache Memory
No ratings yet
Cache Memory
28 pages
Cache Memory and Optimization Guide
No ratings yet
Cache Memory and Optimization Guide
2 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Memory Hierarchies and Cache Performance
No ratings yet
Memory Hierarchies and Cache Performance
7 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Understanding Cache Hierarchies and Misses
No ratings yet
Understanding Cache Hierarchies and Misses
20 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
CA11 2023S1 New
No ratings yet
CA11 2023S1 New
26 pages
Memory & Cache Fundamentals
No ratings yet
Memory & Cache Fundamentals
38 pages
Compiler Optimizations and Prefetching
No ratings yet
Compiler Optimizations and Prefetching
22 pages
Unit II
No ratings yet
Unit II
9 pages
Overfeat
No ratings yet
Overfeat
58 pages
Deep Learning: Sequence Models
No ratings yet
Deep Learning: Sequence Models
85 pages
HUAWEI WiFi Q2 Pro Quick Start Guide - (WS5280-21&PT8020-21,01, EN)
No ratings yet
HUAWEI WiFi Q2 Pro Quick Start Guide - (WS5280-21&PT8020-21,01, EN)
2 pages
Lab Handout 5
No ratings yet
Lab Handout 5
5 pages
Human Detection Robot
No ratings yet
Human Detection Robot
6 pages
EECS 489 Fall 2018 Final Exam
No ratings yet
EECS 489 Fall 2018 Final Exam
7 pages
Design Considerations For Distributed Systems
No ratings yet
Design Considerations For Distributed Systems
26 pages
A First Course in Database Systems (3rd Edition)
100% (2)
A First Course in Database Systems (3rd Edition)
294 pages
APS REPORT Harsh
No ratings yet
APS REPORT Harsh
16 pages
Plugin List
No ratings yet
Plugin List
2 pages
Ict Components I
No ratings yet
Ict Components I
15 pages
VXGI Implementation Guide for UE4
No ratings yet
VXGI Implementation Guide for UE4
7 pages
Class 12 Computer Science Sample Paper Set 8
No ratings yet
Class 12 Computer Science Sample Paper Set 8
11 pages
Calvino Arabic Family CC by NCLicensepdf
No ratings yet
Calvino Arabic Family CC by NCLicensepdf
1 page
Aravindhraj Miniproject Report Final 2
No ratings yet
Aravindhraj Miniproject Report Final 2
93 pages
Unit 2 (Last Topic) Model Based Software Architecture
No ratings yet
Unit 2 (Last Topic) Model Based Software Architecture
4 pages
MT Lanka Freedom Ship Data
No ratings yet
MT Lanka Freedom Ship Data
5 pages
ARC-51BX Military UHF Transceiver Overview
No ratings yet
ARC-51BX Military UHF Transceiver Overview
2 pages
SA-Booklet Teledyne FDIMU PN 2234320-01-01-SA31UG1302149
No ratings yet
SA-Booklet Teledyne FDIMU PN 2234320-01-01-SA31UG1302149
80 pages
Anderol S Series
No ratings yet
Anderol S Series
5 pages
1, 'Bright Gr10 Tests Audios
No ratings yet
1, 'Bright Gr10 Tests Audios
10 pages
Helical Strakes for Vibration Control
No ratings yet
Helical Strakes for Vibration Control
2 pages
HMI Software Manual
No ratings yet
HMI Software Manual
18 pages
AFM 244: Data Analytics Overview
No ratings yet
AFM 244: Data Analytics Overview
3 pages
Stock Scanning for Nifty 50 Traders
No ratings yet
Stock Scanning for Nifty 50 Traders
2 pages
Maker Fair Submission Guide
No ratings yet
Maker Fair Submission Guide
20 pages
Biochemistry Equipment List
No ratings yet
Biochemistry Equipment List
15 pages
Industrial Motion & Presence Sensors
No ratings yet
Industrial Motion & Presence Sensors
2 pages
Thermodynamics Homework Analysis
No ratings yet
Thermodynamics Homework Analysis
4 pages
2022 Browning Safe Brochure
No ratings yet
2022 Browning Safe Brochure
36 pages
Project AI Clue - Profile Sheet - 24102025
No ratings yet
Project AI Clue - Profile Sheet - 24102025
7 pages
030 Problem Set 1
No ratings yet
030 Problem Set 1
5 pages
V4055A, B, D, E On-Off Fluid Power Gas Valve Actuator: Application
No ratings yet
V4055A, B, D, E On-Off Fluid Power Gas Valve Actuator: Application
8 pages
Ocean Wave Energy Converter
No ratings yet
Ocean Wave Energy Converter
9 pages
Infinitude-Australia - Enterprise F1 in Schools
No ratings yet
Infinitude-Australia - Enterprise F1 in Schools
11 pages
User Manual: Items No. 9726, 9727 Ages 3 +
No ratings yet
User Manual: Items No. 9726, 9727 Ages 3 +
12 pages
Event & Exhibition Services
No ratings yet
Event & Exhibition Services
62 pages
Q1L7 Electricity
100% (2)
Q1L7 Electricity
61 pages

Lecture 5 Cache Optimization

Uploaded by

Lecture 5 Cache Optimization

Uploaded by

Lecture 5: Cache Optimization

DAP Spr.‘98 ©UCB 1

AMAT  HitTime  MissRate  MissPenalt y

DAP Spr.‘98 ©UCB 2

DAP Spr.‘98 ©UCB 3

DAP Spr.‘98 ©UCB 4

Pseudo-set associative cache

DAP Spr.‘98 ©UCB 5

DAP Spr.‘98 ©UCB 9

DAP Spr.‘98 ©UCB 10

DAP Spr.‘98 ©UCB 12

DAP Spr.‘98 ©UCB 13

DAP Spr.‘98 ©UCB 14

DAP Spr.‘98 ©UCB 15

DAP Spr.‘98 ©UCB 16

DAP Spr.‘98 ©UCB 17

DAP Spr.‘98 ©UCB 19

DAP Spr.‘98 ©UCB 20

DAP Spr.‘98 ©UCB 21

DAP Spr.‘98 ©UCB 25

You might also like