0% found this document useful (0 votes)

9 views

Cache_optimizations

Uploaded by

Manikanta Sunkara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Cache_optimizations

Uploaded by

Manikanta Sunkara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Ten Advanced Optimizations

of
Cache Performance
Hit Miss Miss H/W Why
Time Penalty Rate Complexi
ty
Larger Block - + 0 Reduce Miss rate
size
Lager Cache - + 1 Reduce Miss rate
size
Higher - + 1 Reduce Miss rate
associativity
Multilevel + 2 Reduce Miss Penalty
caches
Read priority + 1 Reduce Miss Penalty
over writes
Avoid address + 1 Reduce Hit Time
translation
during cache
indexing

+/- improves/hurts factor Blank – no impact 0-3 easy to challenge

Ten Advanced Optimizations of Cache Performance
1. Reducing the hit time -
a. Small and simple first-level caches
b. Way-prediction.
** Both techniques also generally decrease power consumption.
2. Increasing cache bandwidth -
a. Pipelined caches,
b. multibanked caches, and
c. nonblocking caches.
**These techniques have varying impacts on power consumption.
3. Reducing the miss penalty -
a. Critical word first and
b. merging write buffers.
** These optimizations have little impact on power.
4. Reducing the miss rate -
a. Compiler optimizations.
** Obviously any improvement at compile time improves power consumption.
5. Reducing the miss penalty or miss rate via parallelism -
a. Hardware prefetching and
b. compiler prefetching.
**These optimizations generally increase power consumption
Hardware complexity increases as we go through these optimizations and require sophisticated
compiler technology.
1. Small and Simple First-Level Caches to
Reduce Hit Time and Power
● fast clock cycle and power limitations -- Small size .
●
reduce both hit time and power - Simple levels of associativity -- tradeoffs

●
The critical timing path in a cache hit is the three-step process of
1 addressing the tag memory( indexing),
2 comparing tags (tag comparision)
3 Selecting correct way (mux control selection)

Direct-mapped caches can overlap the tag check and transmission of the data -- reduces
hit time.

● Hit time for

●
direct mapped is slightly faster than two-way set associative and ● These estimates
depend on
●
two-way set associative is 1.2 times faster than four-way and technology as
●
four-way is 1.4 times faster than eight-way. well as the size
of the cache

●
Lower levels of associativity reduces power because fewer cache lines are accessed.
Energy Consumption per read increases as cache size and
associativity are increased
● Three other factors that have led to the use of higher associativity
in first-level caches in recent designs.

1 Many processors take at least two clock cycles to access the cache and
thus the impact of a longer hit time may not be critical.

2 To keep the TLB out of the critical path almost all L1 caches should be
virtually indexed.
● This limits the size of the cache to the page size times the associativity,

3 Introduction of multithreading --- conflict misses can increase, making

higher associativity more attractive.
2. Way Prediction to Reduce Hit Time
● Predict the way in a set to reduce hit time
● Reduces conflict misses and the hit speed is of direct-mapped
cache.

● way prediction -- extra bits are kept in the cache to predict the way/
block within the set .
● only a single tag comparison is performed in that clock cycle in parallel with reading the cache data.
● A miss results in checking the other blocks for matches in the next clock cycle.

● Block Predictor bits are added to each block of a cache .

– The bits select which of the blocks to try on the next cache access.
● If the predictor is correct, the cache access latency is the fast
hit time.
● If not, it tries the other block, changes the way predictor, and has
a latency of one extra clock cycle.
● set prediction accuracy is 90% for a two-way set associative cache and 80% for a four-way set
associative cache
● better accuracy on I-caches than D-caches.
● way selection -- use way prediction bits to decide which cache
block to actually access
– saves power when the way prediction is correct
– but adds significant time on a way misprediction,
– likely to make sense only in low-power processors.
– significant draw back for way selection is that it makes it
difficult to pipeline the cache access.
3. Pipelined Cache Access to
Increase Cache Bandwidth
● pipelined cache for faster clock cycle time
● Split cache memory access into several sub stages
● Indexing, Tag read, Hit/Miss check, Data Transfer

● pipeline cache access for high bandwidth

● Intel Pentium processors in the mid-1990s took 1 clock cycle,
● for the Pentium Pro through Pentium III in the mid-1990s through 2000 it took 2 clocks, and
● for the Pentium 4, which became available in 2000, and
● the current Intel Core i7 it takes 4 clocks.

● but slow hits.

– leading to a greater branch miss penalty
● Makes it easier to for high degrees of associativity.
4.Nonblocking Caches (lockup-free cache)
to Increase Cache Bandwidth
● Computers that allow out-of-order execution, the processor need not stall
on a data cache miss.
● Eg: continue fetching instructions from the instruction cache
while waiting for the data cache to return data.

● cache may further lower the effective miss penalty if it can overlap multiple
misses: a “hit under multiple miss” or “miss under miss” optimization.
● “hit under miss” optimization reduces the effective miss penalty
by being helpful during a miss instead of ignoring the requests of
the processor.
● “miss under miss” is beneficial only if the memory system can
service multiple misses;
● high-performance processors ( Intel Core i7) usually support both,
● lower end processors, (ARM A8), provide only limited nonblocking support in L2.
● MSHR- Miss Service Handling Registers
● difficult to judge the impact of any single miss and hence to
calculate the average memory access time.
● miss penalty = (not the sum of the misses) the non-
overlapped time that the processor is stalled.

– The benefit of nonblocking caches is complex, as it depends

upon
– the miss penalty when there are multiple misses,
– the memory reference pattern, and
– how many instructions the processor can execute with a
miss outstanding.
● Out-of-order processors are capable of hiding much of the miss penalty of an L1 data cache miss
that hits in the L2 cache but are not capable of hiding a significant fraction of a lower level cache
miss.

● Deciding how many outstanding misses to support depends on a variety of factors:

● The temporal and spatial locality in the miss stream, which determines whether a miss
can initiate a new access to a lower level cache or to memory
● The bandwidth of the responding memory or cache
● To allow more outstanding misses at the lowest level of the cache requires supporting at
least that many misses at a higher level, since the miss must initiate at the highest level
cache
● The latency of the memory system

● In Li, Chen, Brockman, and Jouppi’s study they found that the reduction in CPI
● For the integer programs was about 7% for one hit under miss and about 12.7% for 64.
● For the floating point programs, 12.7% for one hit under miss and 17.8% for 64.
5.Multibanked Caches to Increase Cache
Bandwidth
● Rather than, cache as a single monolithic block, divide it into
independent banks that can support simultaneous accesses.
● The Arm Cortex-A8 supports 1-4 banks in its L2 cache;
● the Intel Core i7 has 4 banks in L1 and the L2 has 8 banks.

● Banking works best when the accesses naturally spread themselves

across the banks, so the mapping of addresses to banks affects the
behavior of the memory system.
● A simple mapping : sequential interleaving.
● For example, if there are four banks,
● bank 0 has all blocks whose address modulo 4 is 0,
● bank 1 has all blocks whose address modulo 4 is 1, and so on.
Figure 2.6. Four-way interleaved cache banks using block addressing. Assuming 64
bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.

● Multiple banks also are a way to reduce power consumption both in caches and DRAM.
6.Critical Word First and Early
Restart to Reduce Miss Penalty
● Processor normally needs just one word of the block at a time.

● Don’t wait for the entire block to be loaded for restarting the processor.

● Early restart -
● Fetch the words in normal order, but
● as soon as the requested word of the block arrives send it to the processor
● and let the processor continue execution.
● L2 controller is not involved int his technique.
● Critical word first -
● Request the missed word first from memory and
● send it to the processor as soon as it arrives;
● let the processor continue execution while filling the rest of the words in the block.
● L2 cache controller forward words of a block out of order
● L1 cache controller should rearrange words in block.
● These techniques in general benefit designs with large cache
blocks.

● Spatial locality : there is a good chance that the next reference

is to the rest of the block.

● Miss penalty is not simple to calculate.

– When there is a second request in critical word first, the effective miss penalty is the
nonoverlapped time from the reference until the second piece arrives.

● The benefits of critical word first and early restart depend on the
size of the block and the likelihood of another access to the
portion of the block that has not yet been fetched.
7. Merging Write Buffer to Reduce Miss
Penalty
●
Write-through caches use write buffers, to send data to lower level of the hierarchy.

●
Write-back caches use a simple buffer when a block is replaced.

●
If the write buffer is empty:
●
the data and the full address are written in the buffer,
●
write is finished from the processor’s perspective
●
And the processor continues working while the write buffer writes to memory.

Write merging: when performing a write on a block that is already pending in the write
●

buffer, update write buffer

●
Ex: Intell Core i7 uses write merging.

●
Reduces stalls due to full write buffer.

●
Multiword writes are usually faster than writes performed one word at a time.
Figure 2.7 shows a write buffer without and with write merging.
●

Assume we had four entries in the write buffer, and each entry could hold four 64-bit words.
8. Compiler Optimizations to Reduce
Miss Rate
● Loop Interchange
● Blocking
Loop Interchange

● Nested loops access data in nonsequential order.

● Swap nested loops to access the data in sequential order .
● Ex: x is a two-dimensional array of size [5000,100]

● reduces misses by improving spatial locality;

● reordering maximizes use of data in a cache block before they are discarded.
● original code : skip through memory in strides
of 100 words,
● revised version : accesses all the words in one
cache block then go to next block.

● improves cache performance without

affecting the number of instructions executed.
Blocking

● Instead of operating on entire rows or columns subdivide matrices

into blocks.

● Requires more accesses but improves locality of accesses

● The goal is to maximize accesses to the data loaded into the

cache before the data are replaced.

● This optimization improves temporal locality to reduce misses.

White – Not yet touched
Lighter shade – older accesses
Dark – newer accesses

Elements of y and Z are called repeatedly

To calculate new elements of x

Cache: one NXN matrix, one row of N

Then atleast ith row of y and array z may stay in the cache.

Worst case: 2N3+N2 accesses for N3 operations

2N3/B+N2 accesses, improvement
By a factor of B

Blocking exploits locality: y

benefits from spatial locality, z
benefits from temporal locality
9.Hardware Prefetching of Instructions
and Data to Reduce Miss Penalty or Miss
Rate
● Nonblocking caches effectively reduce the miss penalty by overlapping execution with memory
access.

● Another approach is to prefetch items before the processor requests them.

● Both instructions and data can be prefetched, either directly into the caches or into an external buffer
that can be more quickly accessed than main memory.

● Instruction prefetch is frequently done in hardware outside of the cache.

● Typically, the processor fetches two blocks on a miss: the requested block and the next
consecutive block.

● The requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer.

● If the requested block is present in the instruction stream buffer, the original cache request is
canceled, the block is read from the stream buffer, and the next prefetch request is issued.

● A similar approach can be applied to data accesses

10. Compiler-Controlled Prefetching to
Reduce Miss Penalty or Miss Rate
● An alternative to hardware prefetching is for the compiler to
insert prefetch instructions to request data before the
processor needs it.

● There are two flavors of prefetch:

● Register prefetch -- load data into register.
● Cache prefetch loads data into cache.

● Use loop unrolling and scheduling for prefeth data of adjacent

iterations.
● A normal load instruction could be considered a “faulting register prefetch
instruction.”
● Nonfaulting prefetches simply turn into no-ops if they would normally result
in an exception, which is what we want.

● Prefetching makes sense only if the processor can proceed while

prefetching the data; that is, the caches do not stall but continue to supply
instructions and data while waiting for the prefetched data to return.

● The goal is to overlap execution with the prefetching of data.

● If the miss penalty is small, the compiler just unrolls the loop once or twice,
and it schedules the prefetches with the execution.
● If the miss penalty is large, it uses software pipelining or unrolls many times
to prefetch data for a future iteration.
Cache Optimization Summary

The techniques to improve hit time, bandwidth, miss penalty, and

miss rate generally affect the other components of the average
memory access equation as well as the complexity of the memory
hierarchy.

Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
No ratings yet
Ec6009 Advanced Computer Architecture Unit V Memory and I/O: Cache Performance
16 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
No ratings yet
Advanced Computer Architecture-06CS81-Memory Hierarchy Design
18 pages
Lecture16 PDF
No ratings yet
Lecture16 PDF
4 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
No ratings yet
Memory Hierarchies (Part 2) Review: The Memory Hierarchy
7 pages
10-cacheperf
No ratings yet
10-cacheperf
24 pages
10_Caches
No ratings yet
10_Caches
34 pages
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
No ratings yet
CS 322M Digital Logic & Computer Architecture: Cache Optimization Techniques-II
14 pages
Week 13 - Lecture 13 - Memory (cont)
No ratings yet
Week 13 - Lecture 13 - Memory (cont)
31 pages
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
No ratings yet
Memory Hierarchy - Ways To Reduce Misses: DAP Spr. 98 ©UCB 1
23 pages
Improving and Measuring Cache Performance
No ratings yet
Improving and Measuring Cache Performance
8 pages
Cache 1 54
No ratings yet
Cache 1 54
54 pages
Cache Optimizations
No ratings yet
Cache Optimizations
23 pages
UNIT2 Cahe-Opt
No ratings yet
UNIT2 Cahe-Opt
134 pages
Lecture 12: Cache Innovations
No ratings yet
Lecture 12: Cache Innovations
17 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
L18-Cache-Wrap-up
No ratings yet
L18-Cache-Wrap-up
30 pages
Coa Poster Content
No ratings yet
Coa Poster Content
2 pages
Memory Hierarchy 4.0
No ratings yet
Memory Hierarchy 4.0
50 pages
Cache
No ratings yet
Cache
34 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
CA11_2023S1_new
No ratings yet
CA11_2023S1_new
26 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
No ratings yet
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
13 pages
l08 Caches 2
No ratings yet
l08 Caches 2
39 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
17 pages
Crictical Word First For Cache Misses
No ratings yet
Crictical Word First For Cache Misses
21 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
Chapter 2 Adv 2007 PPTV 4
No ratings yet
Chapter 2 Adv 2007 PPTV 4
54 pages
ACA Unit-5
No ratings yet
ACA Unit-5
54 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
115 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
37 pages
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
No ratings yet
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
27 pages
L07-MemoryII
No ratings yet
L07-MemoryII
27 pages
L17
No ratings yet
L17
23 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
No ratings yet
Improving Cache Performance:: Average Memory Access Time Amat T + Miss Rate X Miss Penalty
16 pages
Lecture # 15
No ratings yet
Lecture # 15
27 pages
Jouppi Improving Direct Mapped Cache Performance
No ratings yet
Jouppi Improving Direct Mapped Cache Performance
10 pages
2015Sp CS61C L16 Kavs Caches3
No ratings yet
2015Sp CS61C L16 Kavs Caches3
25 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
CS252 Graduate Computer Architecture Caches and Memory Systems I
No ratings yet
CS252 Graduate Computer Architecture Caches and Memory Systems I
49 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
22 pages
ACA_Lecture_27_Cache_Optimizations
No ratings yet
ACA_Lecture_27_Cache_Optimizations
20 pages
Lec 6
No ratings yet
Lec 6
18 pages
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
No ratings yet
Advanced Cache Optimizations - : Adapted From Patterson and Hennessey (Morgan Kauffman Pubs)
12 pages
Cache Impact On Performance: An Example: Assuming The Following Execution and Cache Parameters
No ratings yet
Cache Impact On Performance: An Example: Assuming The Following Execution and Cache Parameters
32 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
10 Multi-Level Strategies: Assignments
No ratings yet
10 Multi-Level Strategies: Assignments
20 pages
Solution of CSE 240A Assignemnt 3
No ratings yet
Solution of CSE 240A Assignemnt 3
5 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Untitled Document
No ratings yet
Untitled Document
1 page
ACAchap1questions
No ratings yet
ACAchap1questions
2 pages
Git and Github Notes
100% (1)
Git and Github Notes
110 pages
SoftwareEngineering Final
No ratings yet
SoftwareEngineering Final
197 pages
SP User Manual
No ratings yet
SP User Manual
63 pages
SQL Notes
No ratings yet
SQL Notes
16 pages
Introduction to Python in Earth Science Data Analysis 1st Edition Maurizio Petrelli - Download the ebook now and own the full detailed content
100% (3)
Introduction to Python in Earth Science Data Analysis 1st Edition Maurizio Petrelli - Download the ebook now and own the full detailed content
69 pages
Graphics in C Programming Language
No ratings yet
Graphics in C Programming Language
9 pages
FPN - Application Preview
No ratings yet
FPN - Application Preview
1 page
Picrew the Character Maker & Creator
No ratings yet
Picrew the Character Maker & Creator
1 page
Sample Methodology For Literature Review
100% (1)
Sample Methodology For Literature Review
7 pages
AHYW YWT Hydraulic Press Brake Machine: Model: YWT 125T/2500 E21
No ratings yet
AHYW YWT Hydraulic Press Brake Machine: Model: YWT 125T/2500 E21
4 pages
RB SD No. 01
No ratings yet
RB SD No. 01
6 pages
+metkon Metacut 302 Operation and Instruction Manual Mt18-02
No ratings yet
+metkon Metacut 302 Operation and Instruction Manual Mt18-02
28 pages
LAB # 10 To Understand The Decisions: SWITCH Case Statements
No ratings yet
LAB # 10 To Understand The Decisions: SWITCH Case Statements
5 pages
India Connected: How the Smartphone Is Transforming the World's Largest Democracy Ravi Agrawal - Experience the full ebook by downloading it now
No ratings yet
India Connected: How the Smartphone Is Transforming the World's Largest Democracy Ravi Agrawal - Experience the full ebook by downloading it now
37 pages
WT32 S1 DataSheet V1.1
No ratings yet
WT32 S1 DataSheet V1.1
21 pages
Change Over of Controls For Niigata H887
No ratings yet
Change Over of Controls For Niigata H887
45 pages
Content Marketing
100% (1)
Content Marketing
31 pages
QAP New Customer Nov23
No ratings yet
QAP New Customer Nov23
1 page
Scrum For One: How Ful Lled Are You in Your Life?
No ratings yet
Scrum For One: How Ful Lled Are You in Your Life?
1 page
02 - 7 Key Measurement Challenges and Case Studies (Part 1) - Signal Quality, Lots of Channels, Life Beyond Connectors PDF
No ratings yet
02 - 7 Key Measurement Challenges and Case Studies (Part 1) - Signal Quality, Lots of Channels, Life Beyond Connectors PDF
65 pages
Excel Module 2 - Fic SGGSCC
No ratings yet
Excel Module 2 - Fic SGGSCC
19 pages
Makemytrip: 2013 (B) : Research Design For Redesigning The Home Page
No ratings yet
Makemytrip: 2013 (B) : Research Design For Redesigning The Home Page
16 pages
GTM Strategy Understanding
No ratings yet
GTM Strategy Understanding
5 pages
Gas Detection BRC Rev01 0323 V3
No ratings yet
Gas Detection BRC Rev01 0323 V3
16 pages
Methods of Real Analysis
No ratings yet
Methods of Real Analysis
56 pages
Re Training Assignment Notification
No ratings yet
Re Training Assignment Notification
6 pages
OSINT Tools May 16
100% (1)
OSINT Tools May 16
6 pages
Color Standard
No ratings yet
Color Standard
1 page
Ip Ia1 Answer Key 2022-23
No ratings yet
Ip Ia1 Answer Key 2022-23
7 pages
Parts Manual: Model 3600 Pivot Fold Export Planter
No ratings yet
Parts Manual: Model 3600 Pivot Fold Export Planter
232 pages
Sorting-Algorithms-Worksheet-3
No ratings yet
Sorting-Algorithms-Worksheet-3
2 pages

Cache_optimizations

Uploaded by

Cache_optimizations

Uploaded by

Ten Advanced Optimizations

+/- improves/hurts factor Blank – no impact 0-3 easy to challenge

● Hit time for

3 Introduction of multithreading --- conflict misses can increase, making

● Block Predictor bits are added to each block of a cache .

● pipeline cache access for high bandwidth

● but slow hits.

– The benefit of nonblocking caches is complex, as it depends

● Deciding how many outstanding misses to support depends on a variety of factors:

● Banking works best when the accesses naturally spread themselves

● Spatial locality : there is a good chance that the next reference

● Miss penalty is not simple to calculate.

buffer, update write buffer

● Nested loops access data in nonsequential order.

● reduces misses by improving spatial locality;

● improves cache performance without

● Instead of operating on entire rows or columns subdivide matrices

● Requires more accesses but improves locality of accesses

● The goal is to maximize accesses to the data loaded into the

● This optimization improves temporal locality to reduce misses.

Elements of y and Z are called repeatedly

Cache: one NXN matrix, one row of N

Worst case: 2N3+N2 accesses for N3 operations

Blocking exploits locality: y

● Another approach is to prefetch items before the processor requests them.

● Instruction prefetch is frequently done in hardware outside of the cache.

● A similar approach can be applied to data accesses

● There are two flavors of prefetch:

● Use loop unrolling and scheduling for prefeth data of adjacent

● Prefetching makes sense only if the processor can proceed while

● The goal is to overlap execution with the prefetching of data.

The techniques to improve hit time, bandwidth, miss penalty, and

You might also like