0% found this document useful (0 votes)

43 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

EE282 Lecture 4

Advanced Caching (2)

Jacob Leverich

http://eeclass.stanford.edu/ee282

EE282 – Spring 2011 – Lecture 04

Announcements
 HW1 out
 Due Wed 4/20 @ 5pm, box outside Gates 305

2
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

3
Advanced Cache Optimizations
 Multi-level caches and inclusion
 Victim caches
 Pseudo-associative caches
 Skew-associative caches
 Critical word first
 Non-blocking caches
 Prefetching
 Multi-ported caches

 Readings: H&P 5.1-2 and 4.2

 Read on your own about way prediction, pipelined caches, merging
write buffers, compiler optimizations

4
Non-blocking or Lockup Free
Caches
 Basic idea
 Allow for hits while serving a miss (hit-under-miss)
 Allow for more than one outstanding miss (miss-under-miss)
 When does it make sense (for L1, L2, …)
 When the processor can handle >1 pending load/store
 This is the case with superscalar processors
 When the cache serves >1 processor or other cache
 When the lower level allows for multiple pending accesses
 More on this later
 What is difficult about non-blocking caches
 Handling multiple misses at the same time
 Handling loads to pending misses
 Handling stores to pending misses
5
Potential of Non-blocking
Caches
CPU Stall CPU on miss
Miss Penalty

Miss

Miss Hit
CPU CPU Hit under miss
Miss Penalty

Stall only when

result needed
Miss Hit Miss
CPU
Miss Penalty Multiple out-standing misses
Miss Penalty
Miss Penalty

6
Miss Status Handling Register
 Keeps track of
 Outstanding cache misses
 Pending load & stores that refer to that cache block
 Fields of an MSHR
 Valid bit
 Cache block address
 Must support associative search
 Issued bit (1 if already request issued to memory)
 For each pending load or store
 Valid bit
 Type (load/store) and format (byte/halfword/…)
 Block offset
 Destination register for load OR store buffer entry for stores
7
MSHR

1 27 1 1 3 5 5
Valid Block Address Issued Valid Type Block Offset Destination Load/store 0

Valid Type Block Offset Destination Load/store 1

Valid Type Block Offset Destination Load/store 2

Valid Type Block Offset Destination Load/store 3

8
Non-block Caches: Operation
 On a cache miss:
 Search MSHRs for pending access to same cache block
 If yes, just allocate new load/store entry
 (if no) Allocate free MSHR
 Update block address and first load/store entry
 If no MSHR or load/store entry free, stall
 When one word/sub-block for cache line become available
 Check which load/stores are waiting for it
 Forward data to LSU
 Mark loads/store as invalid
 Write word in the cache
 When last word for cache line is available
 Mark MSHR as invalid
9
Non-blocking Cache Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Non-blocking
cache

10
Prefetching
 Idea: fetch data into the cache before processors
request them
 Can address cold misses
 Can be done by the programmer, compiler, or hardware

 Characteristics of ideal prefetching

 You only prefetch data that are truly needed
 Avoid bandwidth waste
 You issue prefetch requests early enough
 To hide the memory latency
 You don’t issue prefetch requests too early
 To avoid cache pollution
11
Software Prefetching
for (i=0; i<N; i++) {  Issues software prefetching
__prefetch(a[i+8]);  Takes up issue slots
__prefetch(b[i+8]);  Not big issue with superscalar
sum += a[i]*b[i];  Takes up system bandwidth
}
 Must have non-blocking caches
Doesn’t have to be correct!
 Prefetch distance depends on
__prefetch(-1);
specific system implementation
 Non-portable code
 Not easy to use for pointer based
structures
 Requires ninja
programmer/compiler!
12
Hardware Prefetching
 Same goal with software prefetching but initiated by hardware
 Can tune to specific system implementation
 Does not waste instruction issue bandwidth
 More portable code
 Major design questions
 Where to place a prefetch engine?
 L1, L2, …
 What to prefetch?
 Next sequential cache line(s), strided patterns, pointers, …
 When to prefetch?
 On a load, on a miss, when other prefetched data used, …
 Where to place prefetched data
 In the cache or in a special prefetch buffer
 How to handle VM exceptions?
 Don’t prefetch beyond a page?
13
Simple Sequential Prefetching
 On a cache miss, fetch two sequential memory
blocks
 Exploits spatial locality in both instructions & data
 Exploits high bandwidth for sequential accesses

 Called “Adjacent Cache Line Prefetch” or “Spatial

Prefetch” by Intel

 Extend to fetching N sequential memory blocks

 Pick N large enough to hide the memory latency
14
Stream Prefetching
 Sequential prefetching problem
 Performance slows down once every N cache lines
 Stream prefetching is a continuous version of prefetching
 Stream buffer can fit N cache lines
 On a miss, start fetching N sequential cache lines
 On a stream buffer hit:
 Move cache line to cache, start fetching line (N+1)
 In other words, stream buffer tries to stay N cache lines ahead
 Design issues
 When is a stream buffer allocated
 When is a stream buffer released
 Can use multiple stream buffers to capture multiple streams
 E.g. a program operating on 2 arrays

15
Stream Buffer Design

16
Strided Prefetching
PC Stride Last Addr Conf
 Idea: detect and prefetch strided accesses
PC
 for (i=0; i<N; i++) A[i*1024]++; 0x08ab0 8 0xff024 10

 Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11

 For each PC, remember the stride

 Stride detection
 Remember the last address used for this PC
 Compare to currently used address for this PC
 Track confidence using a two bit saturating counter
 Increment when stride correct, decrement when incorrect

 How to use the PC-based table

 Similar to stream prefetching except using stride instead of +1
18
Sandybridge Prefetching
(Intel Core i7-2600K)
 “Intel 64 and IA-32 Architectures Optimization
Reference Manual, Jan 2011”, pg 2-24

http://www.intel.com/Assets/PDF/manual/248966.pdf
19
Other Ideas in Prefetching
 Prefetch for pointer-based data structures
 Predict if fetched data contain a pointer & follow it
 Works for linked-lists, graphs, etc
 Must be very careful:
 What is a pointer?
 How far to prefetch?

 Different correlation techniques

 Markov prefetchers
 Delta correlation prefetchers

20
Prefetching Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Prefetching

21
Multi-ported Caches
 Idea: allow for multiple accesses in parallel
 Processor with many LSUs, I+D access in L2, …

 Can be implemented in multiple ways

 True multi-porting
 Multiple banks

 What is difficult about multi-porting

 Interaction between parallel accesses (especially for
stores)
22
True Multi-porting
 True multiporting
 Use 2-ported tag/data storage
 Problem: large area increase
 Problem: hit time increase

Request 1 Data 1

Request 2 Cache Data 2

23
Multi-banked Caches
Cache
Request 1 Read Data 1
Bank 1

Cache
Request 2 Read Data 2
Bank 2

 Partition address space into multiple banks

 Bank0 caches addresses from partition 0, bank1 from partition 1…
 Can use least or most significant address bits for partitioning
 What are the advantages of each approach?

 Benefits: accesses can go in parallel if no conflicts

 Challenges: conflicts, distribution network, bank
utilization
24
Sun UltraSPARC T2
8-bank L2 cache

25
Multi-porting Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Multi-porting

26
Summary of
Advanced Cache Optimizations
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width

Multi-level +
Victim cache ~ +
Pseudo-assoc. +/~
Skew-assoc. + ~
Non-blocking + ~
Critical-word-
first +
Prefetching + - +
Multi-porting ~ +
Also see Figure 5.11 in H&P 27
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

28
Cache Coherence Problem
P1 P2 P3
u=? 3
u=?
4 5 $
$ $

u :5 u :5 u= 7

I/O devices
1
2
u:5
Memory

 Cores may see different values for u

 With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
 Threads or processes accessing main memory may see very stale value
 Unacceptable for programming, and its frequent!
29
Hardware Cache Coherence
Using Snooping
 Hardware guarantees that loads from all
cores will return the value of the latest
write

 Coherence mechanisms
 Metadata to track state for cached data
 Controller that snoops bus (or interconnect)
activity and reacts if needed to adjust the state
of the cache data

 There needs to be a serialization point

 Shared L3, memory controller, or memory bus

30
MSI: Simple Coherence Protocol for Write
Back Caches
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other process reads, M or writes
P1 writes back
Other processor
intent to write

Read
miss
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
31
Quick Questions
 How many copies of a cache line can you have
in S state?

 How many copies can you have in M state?

 How does L2 inclusion help?

33
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

34
Software-managed Memory
 Caches are complex, hard to design, hard to
optimize, hard to analyze, hard to use well,
hard to keep coherent…

 Private on-chip memory with its own address

space
 Not implicitly backed by main memory
 Also called “Local Store”, “Local Memory”,
“Scratchpad”, “Stream Register File”
 Ubiquitous in embedded computing space
35
Local Stores in the wild
 IBM Cell Processor
 256KB LS per core
 Shared by inst. and data!

Playstation 3!
36
Cache vis-à-vis Local Store
 Cache  Local Store

37
Local Stores: AMAT
AMAT = HitTime + MissRate * MissPenalty

 MissRate = 0%!

Consequences?
 Simpler performance analysis
 Less motivation for out-of-order cores
 Cell processor is in-order
 High clock rate and low power
38
Local Stores: Operation
 LD/ST instructions to LS proceed normally
 No LD/ST to non-LS memory

 DMA transfers (Direct Memory Access) to

move data to/from main memory and LS
 Bulk, like memcpy()
 Asynchronous

dma(void local_address, void remote_address,

int size, int tag, boolean direction);

39
Stream Programming
Time

get(a) do_something(a) get(b) do_something(b)

get(a) get(b)
do_something(a) do_something(b)

 Overlap communication with computation

 Hide memory latency
 “Macroscopic” software prefetching
 No ugly prefetch instructions interlaced w/ your code
 Doesn’t waste instruction issue bandwidth
40
Local Stores: Pros and Cons
Pros Cons
 No coherence!  No coherence…
 Simple to implement  Complex to program
 Less overhead (no tags)  Can’t run existing SW
 Predictable performance,  Unpredictable access
great for in-order cores patterns perform poorly
 Can potentially hide all  Pointer chasing difficult
memory latency (linked lists, trees, etc.)

People resort to implementing

set-associative caches in
software…
41
Local Store Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Local Store

SW Complexity

42
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

43
Everything is a Cache for
Something Else
Access Time Capacity Managed by
Registers 1 cycle ~500B Software/compiler

Level 1 Cache 1-3 cycles ~64KB Hardware

Level 2 Cache 5-10 cycles 1-10MB Hardware

DRAM ~100 cycles ~10GB Software/OS

Disk
106-107 cycles ~1TB Software/OS

The
Tape Interwebs
44
Example: File cache
 Do files exhibit locality?
 Prefetching?
 Write back or write  Microsoft “SuperFetch”:
through? load common programs
at boot
 When should we write
to disk?
 Coherence?
 Associativity?  “Leases” in network
filesystems
 Place arbitrarily and
keep an index
 Most disks have caches
45
Example: Browser cache
 Do web pages you visit
exhibit locality?
 Coherence?
 Write back or write  Did the page change
through? since I last checked?
 No writes!  Relaxed coherence
 “If-Modified-Since”
header
 Replacement policy?
 Probably LRU
 AMAT?
46
Caching is a ubiquitous tool
 Same design issues in system design as in
processor design
 Placement, lookup, write policies, replacement
policies, coherence

 Same optimization dimensions

 Size, associativity, granularity
 Hit time, miss rate, miss penalty, bandwidth,
complexity

47
Next Lecture

 DRAM (Main Memory)

Data Engineering Cookbook
100% (1)
Data Engineering Cookbook
125 pages
Module - 6
No ratings yet
Module - 6
89 pages
Cache Presentation
No ratings yet
Cache Presentation
45 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Lecture 8 Cont. Cache Memory
No ratings yet
Lecture 8 Cont. Cache Memory
29 pages
Lab3 Suppl
No ratings yet
Lab3 Suppl
25 pages
Module 5
No ratings yet
Module 5
17 pages
09 Caches Tlbs
No ratings yet
09 Caches Tlbs
33 pages
10 Caches Detail
No ratings yet
10 Caches Detail
45 pages
Advanced Caching Techniques: Approaches To Improving Memory System Performance
No ratings yet
Advanced Caching Techniques: Approaches To Improving Memory System Performance
18 pages
Cache Writing & Performance
No ratings yet
Cache Writing & Performance
23 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
Lec14 Demandpage
No ratings yet
Lec14 Demandpage
25 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
Memory 2
No ratings yet
Memory 2
31 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Cache
No ratings yet
Cache
36 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
5.5 Cache Organization
No ratings yet
5.5 Cache Organization
8 pages
Module4 CAche Performance
No ratings yet
Module4 CAche Performance
40 pages
10 Caches
No ratings yet
10 Caches
34 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
No ratings yet
Computer Science 246 Computer Architecture: Si 2009 Spring 2009 Harvard University
27 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
R RRRRRRRR Final
No ratings yet
R RRRRRRRR Final
28 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
Address Translation, Caches, and Tlbs
No ratings yet
Address Translation, Caches, and Tlbs
32 pages
10 Multi-Level Strategies: Assignments
No ratings yet
10 Multi-Level Strategies: Assignments
20 pages
Lecture 8 Memory Hierachy-Virtual Memories
No ratings yet
Lecture 8 Memory Hierachy-Virtual Memories
28 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
17 pages
Lec 34
No ratings yet
Lec 34
26 pages
BGP Overview FAL
100% (1)
BGP Overview FAL
24 pages
Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
CompArch Cheatsheet
No ratings yet
CompArch Cheatsheet
2 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
CS7810 Prefetching: Seth Pugsley
No ratings yet
CS7810 Prefetching: Seth Pugsley
22 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Final Report Restaurant Management System
No ratings yet
Final Report Restaurant Management System
36 pages
Unit II
No ratings yet
Unit II
9 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
Lecture16 PDF
No ratings yet
Lecture16 PDF
4 pages
Huawei SRv6
100% (1)
Huawei SRv6
97 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
Entry Test PHARM D - 2023 1
No ratings yet
Entry Test PHARM D - 2023 1
1 page
Memory Hir and Io System
No ratings yet
Memory Hir and Io System
26 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Module5 - Identity and Access Management
No ratings yet
Module5 - Identity and Access Management
84 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
Memory Hierarchy Design-Aca
No ratings yet
Memory Hierarchy Design-Aca
15 pages
Computer Architecture and Organization: Lecture14: Cache Memory Organization
No ratings yet
Computer Architecture and Organization: Lecture14: Cache Memory Organization
18 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
Prefetching Using Markov Predictors: Grunwald@cs - Colorado.edu
No ratings yet
Prefetching Using Markov Predictors: Grunwald@cs - Colorado.edu
12 pages
Fujikura BV 80S PDF
No ratings yet
Fujikura BV 80S PDF
3 pages
01-01 Segment Routing
No ratings yet
01-01 Segment Routing
12 pages
Simplified Evpn Vxlan
No ratings yet
Simplified Evpn Vxlan
75 pages
Application Development
No ratings yet
Application Development
87 pages
How Do We Want To Interact With Robotic Environments.
No ratings yet
How Do We Want To Interact With Robotic Environments.
20 pages
Oracle DBA Tuning
No ratings yet
Oracle DBA Tuning
10 pages
3 - Evolution of The Transport System
No ratings yet
3 - Evolution of The Transport System
4 pages
Tybsc It 26072019
No ratings yet
Tybsc It 26072019
91 pages
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
No ratings yet
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
115 pages
Grade 12 Data Processing
No ratings yet
Grade 12 Data Processing
24 pages
Cell Phones The Primary Personal Mobile Computing Devices
No ratings yet
Cell Phones The Primary Personal Mobile Computing Devices
4 pages
Interratouch Technical Info v1 Eng
No ratings yet
Interratouch Technical Info v1 Eng
9 pages
Ubuntu Oneric Terminal Komande
No ratings yet
Ubuntu Oneric Terminal Komande
140 pages
MS Paint Old
No ratings yet
MS Paint Old
7 pages
Advanced Web Programming - Chapter 3
No ratings yet
Advanced Web Programming - Chapter 3
10 pages
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
No ratings yet
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
4 pages
CSC159 Ch2 Numbering System
No ratings yet
CSC159 Ch2 Numbering System
23 pages
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
No ratings yet
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
3 pages
MP800
No ratings yet
MP800
9 pages
NODE2 Lsinventory Detail
No ratings yet
NODE2 Lsinventory Detail
65 pages
Translating Subtitles: CAT Week-8 Spring 2019 Wei, LIU
No ratings yet
Translating Subtitles: CAT Week-8 Spring 2019 Wei, LIU
14 pages
DB Tata Tchitchikoshvili HW
No ratings yet
DB Tata Tchitchikoshvili HW
8 pages
SIM808GPSGSMmanual 1685940223
No ratings yet
SIM808GPSGSMmanual 1685940223
12 pages
Cpe 2023 06132024
No ratings yet
Cpe 2023 06132024
3 pages
Web Assignment (Pair)
No ratings yet
Web Assignment (Pair)
5 pages
LAPORAN PRAKTIKUM CRUD DATA PHP OOP MySQL LANJUTAN RETNO XII RPL D 30
No ratings yet
LAPORAN PRAKTIKUM CRUD DATA PHP OOP MySQL LANJUTAN RETNO XII RPL D 30
7 pages
How To Add AdSense Ads at The End of The Post in Blogger
No ratings yet
How To Add AdSense Ads at The End of The Post in Blogger
2 pages
Intro
No ratings yet
Intro
24 pages
STS Installation Instructions 2.7.1.RELEASE
No ratings yet
STS Installation Instructions 2.7.1.RELEASE
15 pages
ExtremeXOS Switching and Routing
No ratings yet
ExtremeXOS Switching and Routing
4 pages
Bandwidth and Applications Report
No ratings yet
Bandwidth and Applications Report
7 pages
NCCR Tracker 3G
No ratings yet
NCCR Tracker 3G
12 pages

Stanford Advanced Caches

Uploaded by

Stanford Advanced Caches

Uploaded by

EE282 Lecture 4

Advanced Caching (2)

EE282 – Spring 2011 – Lecture 04

 Beyond processor caches

 Readings: H&P 5.1-2 and 4.2

Stall only when

Valid Type Block Offset Destination Load/store 1

Valid Type Block Offset Destination Load/store 2

Valid Type Block Offset Destination Load/store 3

Cache Miss rate Miss Hit Band-

 Characteristics of ideal prefetching

 Called “Adjacent Cache Line Prefetch” or “Spatial

 Extend to fetching N sequential memory blocks

 Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11

 For each PC, remember the stride

 How to use the PC-based table

 Different correlation techniques

Cache Miss rate Miss Hit Band-

 Can be implemented in multiple ways

 What is difficult about multi-porting

Request 2 Cache Data 2

 Partition address space into multiple banks

 Benefits: accesses can go in parallel if no conflicts

Cache Miss rate Miss Hit Band-

 Beyond processor caches

 Cores may see different values for u

 There needs to be a serialization point

 How many copies can you have in M state?

 How does L2 inclusion help?

 Beyond processor caches

 Private on-chip memory with its own address

 DMA transfers (Direct Memory Access) to

dma(void *local_address, void *remote_address,

get(a) do_something(a) get(b) do_something(b)

 Overlap communication with computation

People resort to implementing

Cache Miss rate Miss Hit Band-

 Beyond processor caches

Level 1 Cache 1-3 cycles ~64KB Hardware

Level 2 Cache 5-10 cycles 1-10MB Hardware

DRAM ~100 cycles ~10GB Software/OS

 Same optimization dimensions

 DRAM (Main Memory)

You might also like

dma(void local_address, void remote_address,