Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture
ETH Zürich
Spring 2024
24 May 2024
The Memory Hierarchy
Memory Hierarchy in a Modern System (I)
L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE
DRAM INTERFACE
DRAM BANKS
CORE 0 CORE 1
DRAM MEMORY
CONTROLLER
L2 CACHE 2
L2 CACHE 3
CORE 2 CORE 3
Apple M1,
2021
Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 4
Memory Hierarchy in a Modern System (III)
A lot of
Storage DRAM SRAM DRAM Storage
7
https://download.intel.com/newsroom/kits/40thanniversary/gallery/images/Pentium_4_6xx-die.jpg
Intel Pentium 4, 2000
Memory Hierarchy in a Modern System (IV)
Core Count:
8 cores/16 threads
L1 Caches:
32 KB per core
L2 Caches:
512 KB per core
L3 Cache:
32 MB shared
Cores:
15-16 cores,
8 threads/core
L2 Caches:
2 MB per core
L3 Cache:
120 MB shared
https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 9
Memory Hierarchy in a Modern System (VI)
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
11
The Problem
Ideal memory’s requirements oppose each other
Bigger is slower
Bigger Takes longer to determine the location
12
The Problem
Bigger is slower
SRAM, < 1KByte, sub-nanosec
SRAM, KByte~MByte, ~nanosec
DRAM, Gigabyte, ~50 nanosec
PCM-DIMM (Intel Optane DC DIMM), Gigabyte, ~300 nanosec
PCM-SSD (Intel Optane SSD), Gigabyte ~Terabyte, ~6-10 µs
Flash memory, Gigabyte~Terabyte, ~50-100 µs
Hard Disk, Terabyte, ~10 millisec
Faster is more expensive (monetary cost and chip area)
SRAM, < 0.3$ per Megabyte
DRAM, < 0.006$ per Megabyte
PCM-DIMM (Intel Optane DC DIMM), < 0.004$ per Megabyte
PCM-SSD, < 0.002$ per Megabyte
Flash memory, < 0.00008$ per Megabyte
Hard Disk, < 0.00003$ per Megabyte
These sample values (circa ~2023) scale with time
Other technologies have their place as well
FeRAM, MRAM, RRAM, STT-MRAM, memristors, … (not mature yet)
13
The Problem (Table View)
Bigger is slower
Memory Device Capacity Latency Cost per Megabyte
SRAM < 1 KByte sub-nanosec
SRAM KByte~MByte ~nanosec < 0.3$
DRAM Gigabyte ~50 nanosec < 0.006$
PCM-DIMM
Gigabyte ~300 nanosec < 0.004$
(Intel Optane DC DIMM)
PCM-SSD Gigabyte ~6-10 µs
< 0.002$
(Intel Optane SSD) ~Terabyte
Gigabyte ~50-100 µs
Flash memory < 0.00008$
~Terabyte
~10 millisec
Hard Disk Terabyte < 0.00003$
PCM-DIMM
~300
(Intel Optane DC Gigabyte < 0.004$ ~80-540 pJ ~20-135 pJ
nanosec
DIMM)
PCM-SSD Gigabyte
~6-10 µs < 0.002$ ~120 µJ ~30 nJ
(Intel Optane SSD) ~Terabyte
Gigabyte
Flash memory ~50-100 µs < 0.00008$ ~250 µJ ~61 nJ
~Terabyte
Disclaimer: Take the energy values with a grain of salt as there are different assumptions
Aside: The Problem (2011 Version)
Bigger is slower
SRAM, 512 Bytes, sub-nanosec
SRAM, KByte~MByte, ~nanosec
DRAM, Gigabyte, ~50 nanosec
Hard Disk, Terabyte, ~10 millisec
17
The Memory Hierarchy
back up
everything large but slow
here
18
Memory Hierarchy
Fundamental tradeoff
Fast memory: small
Large memory: slow
Idea: Memory hierarchy
Hard Disk
Main
CPU Cache Memory
RF (DRAM)
19
Memory Hierarchy Example
21
Memory Locality
A “typical” program has a lot of locality in memory
references
typical programs are composed of “loops”
22
Caching Basics: Exploit Temporal Locality
Idea: Store recently accessed data in automatically-managed
fast memory (called cache)
Anticipation: same mem. location will be accessed again soon
23
Caching Basics: Exploit Spatial Locality
Idea: Store data in addresses adjacent to the recently
accessed one in automatically-managed fast memory
Logically divide memory into equal-size blocks
Fetch to cache the accessed block in its entirety
Anticipation: nearby memory locations will be accessed soon
24
The Bookshelf Analogy
Book in your hand
Desk
Bookshelf
Boxes at home
Boxes in storage
25
Caching in a Pipelined Design
The cache needs to be tightly integrated into the pipeline
Ideally, access in 1-cycle so that load-dependent operations
do not stall
High frequency pipeline Cannot make the cache large
But, we want a large cache AND a pipelined design
Idea: Cache hierarchy
Main
Level 2 Memory
CPU Level1 Cache (DRAM)
RF Cache
26
A Note on Manual vs. Automatic Management
Manual: Programmer manages data movement across levels
-- too painful for programmers on substantial programs
“core” vs “drum” memory in the 1950s
You don’t need to know how big the cache is and how it works to
write a “correct” program! (What if you want a “fast” program?)
27
Caches and Scratchpad in a Modern GPU
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 30
Cerebras’s Wafer Scale Engine (2019)
The largest ML accelerator chip
400,000 cores
18 GB of on-chip memory
850,000 cores
40 GB of on-chip memory
By Orion 8 - Combined from Magnetic core memory card.jpg and Magnetic core.jpg., CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=11235412
34
A Historical Perspective
By Orion 8 - Combined from Magnetic core memory card.jpg and Magnetic core.jpg., CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=11235412
35
Automatic Management in Memory Hierarchy
Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Trans. On Electronic Computers, 1965.
37
Cache in 1962 (Bloom, Cohen, Porter)
Data Store
Tag Store
38
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
manual/compiler
Memory register spilling
L1 cache
Abstraction ~10s of KB, ~nsec
L2 cache
100s of KB ~ few MB, many nsec automatic
HW cache
L3 cache, management
many MBs, even more nsec
hi + mi = 1
Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1
Keep mi low
increasing capacity Ci lowers mi, but beware of increasing ti
lower mi by smarter cache management (replacement::anticipate
what you don’t need, prefetching::anticipate what you will need)
Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2004.
Intel Pentium 4 Example
L2 Cache
https://download.intel.com/newsroom/kits/40thanniversary/gallery/images/Pentium_4_6xx-die.jpg
43
Intel Pentium 4 Example
90nm P4, 3.6 GHz Ti = ti + mi ·Ti+1
L1 D-cache
if m1=0.1, m2=0.1
C1 = 16 kB T1=7.6, T2=36
t1 = 4 cyc int / 9 cycle fp
L2 D-cache if m1=0.01, m2=0.01
T1=4.2, T2=19.8
C2 = 1024 kB
t2 = 18 cyc int / 18 cyc fp if m1=0.05, m2=0.01
Main memory T1=5.00, T2=19.8
t3 = ~ 50ns or 180 cyc if m1=0.01, m2=0.50
Notice T1=5.08, T2=108
best case latency is not 1
worst case access latencies are into 500+ cycles
Cache Basics and Operation
Cache
Any structure that “memoizes” used (or produced) data
to avoid repeating the long-latency operations required to
reproduce/fetch the data from scratch
e.g., a web cache
46
Conceptual Picture of a Cache
Metadata
48
Logical Organization of a Cache (II)
A key question: How to map chunks of the main memory
address space to blocks in the cache?
Which location in cache can a given “main memory chunk” be
placed in?
On a reference:
HIT: If in cache, use cached data instead of accessing memory
MISS: If not in cache, bring block into cache
May have to evict some other block
Address
Tag Store Data Store
Hit/miss? Data
52
Blocks and Addressing the Cache
Main memory logically divided into fixed-size chunks (blocks)
Cache can house only a limited number of blocks
53
Blocks and Addressing the Cache
Main memory logically divided into fixed-size chunks (blocks)
Cache can house only a limited number of blocks
2b 3 bits 3 bits
1) index into the tag and data stores with index bits in address
2) check valid bit in tag store
3) compare tag bits in address with the stored tag in tag store
If the stored tag is valid and matches the tag of the block,
then the block is in the cache (cache hit)
54
Let’s See A Toy Example
We will examine a direct-mapped cache first
Direct-mapped: A given main memory block can be placed in
only one possible location in the cache
57
Set Associativity
Problem: Addresses N and N+8 always conflict in direct mapped cache
Idea: enable blocks with the same index to map to > 1 cache location
Example: Instead of having one column of 8, have 2 columns of 4 blocks
Tag store Data store
SET
V tag V tag
=? =? MUX
2-way set associative cache: Blocks with the same index can map to 2 locations
Higher Associativity
4-way Tag store
=? =? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX Address
tag index byte in block
-- More tag comparators and wider data mux; larger tag store
4-way set associative cache: Blocks with the same index can map to 4 locations
Full Associativity
Fully associative cache
A block can be placed in any cache location
Tag store
=? =? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits
Fully associative cache: Any block can map to any location in the cache
Associativity (and Tradeoffs)
Degree of associativity: How many blocks can map to the
same index (or set)?
Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (more comparators)
hit rate
associativity
61
Issues in Set-Associative Caches
Think of each block in a set having a “priority”
Indicating how important it is to keep the block in the cache
Key issue: How do you determine/adjust block priorities?
There are three key decisions in a set:
Insertion, promotion, eviction (replacement)
63
Implementing LRU
Idea: Evict the least recently accessed block
Problem: Need to keep track of access order of blocks
Why?
True LRU is complex
LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)
Examples:
Not MRU (not most recently used)
Hierarchical LRU: divide the N-way set into M “groups”, track
the MRU group and the MRU way in each group
Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
65
Cache Replacement Policy: LRU or Random
LRU vs. Random: Which one is better?
Example: 4-way cache, cyclic references to A, B, C, D, E
0% hit rate with LRU policy
Set thrashing: When the “program working set” in a set is
larger than set associativity
Random replacement policy is better when thrashing occurs
In practice:
Performance of replacement policy depends on workload
Average hit rate of LRU and Random are similar
67
Recommended Reading
Key observation: Some misses more costly than others as their latency is
exposed as stall time. Reducing miss rate is not always good for
performance. Cache replacement should take into account cost of misses.
68
What’s In A Tag Store Entry?
Valid bit
Tag
Replacement policy bits
Dirty bit?
Write back vs. write through caches
69
Handling Writes (I)
When do we write the modified data in a cache to the next level?
Write through: At the time the write happens
Write back: When the block is evicted
Write-back cache
+ Can combine multiple writes to the same block before eviction
Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”
Write-through cache
+ Simpler design
+ All levels are up to date & consistent Simpler cache coherence: no
need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes
70
Handling Writes (II)
Do we allocate a cache block on a write miss?
Allocate on write miss: Yes
No-allocate on write miss: No
No-allocate
+ Conserves cache space if locality of written blocks is low
(potentially better cache hit rate)
71
Handling Writes (III)
What if the processor writes to an entire block over a small
amount of time?
Is there any need to bring the block into the cache from
memory in the first place?
72
Subblocked (Sectored) Caches
Idea: Divide a block into subblocks (or sectors)
Have separate valid and dirty bits for each subblock (sector)
Allocate only a subblock (or a subset of subblocks) on a request
Second-level caches
Decisions need to balance hit rate and access latency
Usually large and highly associative; latency not as important
Tag store and data store can be accessed serially
Previous level acts as a filter (filters some temporal & spatial locality)
Management policies are different across cache levels
76
Deeper and Larger Cache Hierarchies
Apple M1,
2021
Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 77
Deeper and Larger Cache Hierarchies
Core Count:
8 cores/16 threads
L1 Caches:
32 KB per core
L2 Caches:
512 KB per core
L3 Cache:
32 MB shared
https://youtu.be/gqAYMx34euU 80
https://www.tech-critter.com/amd-keynote-computex-2021/
Deeper and Larger Cache Hierarchies
IBM POWER10,
2020
Cores:
15-16 cores,
8 threads/core
L2 Caches:
2 MB per core
L3 Cache:
120 MB shared
https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 81
Deeper and Larger Cache Hierarchies
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 84
NVIDIA V100 & A100 Memory Hierarchy NVIDIA A100 Tensor Core GPU Architecture In-Depth
A100 feature:
Direct copy from L2
A100 improves SM bandwidth efficiency with a new load-global-store-shared asynchronous copy to scratchpad,
instruction that bypasses L1 cache and register file (RF). Additionally, A100’s more efficient Tensor
Cores reduce shared memory (SMEM) loads.
bypassing L1 and
register file.
Figure 15. A100 SM Data Movement Efficiency
https://images.nvidia.com/aem-dam/ en-zz/Solutions/data-center/nvidia-ampere-archit ecture-whitepaper. pdf
New asynchronous barriers work together with the asynchronous copy instruction to enable
85
efficient data fetch pipelines, and A100 increases maximum SMEM allocation per SM 1.7x to
Memory in the NVIDIA H100 GPU
SM SM SM
Control Control Control
Core Core Core Core Core Core Core Core Core Core Core Core
SM-to-SM
Direct copy L2 Cache 60 MB
Block size
Associativity
Replacement policy
Insertion/Placement policy
Promotion Policy
89
Cache Size
Cache size: total data capacity (not including tag store)
bigger cache can exploit temporal locality better
hit rate
Too small blocks
do not exploit spatial locality well
have larger tag overhead
93
Associativity
How many blocks can be present in the same index (i.e., set)?
Larger associativity
lower miss rate (reduced conflicts)
higher hit latency and area cost
hit rate
Smaller associativity
lower cost
lower hit latency
Especially important for L1 caches
associativity
Is power of 2 associativity required?
94
Recall: Higher Associativity (4-way)
4-way
Tag store
=? =? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits
95
Higher Associativity (3-way)
3-way
Tag store
=? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits
96
Recall: 8-way Fully Associative Cache
Tag store
=? =? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits
97
7-way Fully Associative Cache
Tag store
=? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits
98
Classification of Cache Misses
Compulsory miss
first reference to an address (block) always results in a miss
subsequent references to the block should hit in cache unless
the block is displaced from cache for the reasons below
Capacity miss
cache is too small to hold all needed data
defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity
Conflict miss
defined as any miss that is neither a compulsory nor a
capacity miss
99
How to Reduce Each Miss Type
Compulsory
Caching (only accessed data) cannot help; larger blocks can
Prefetching helps: Anticipate which blocks will be needed soon
Conflict
More associativity
Other ways to get more associativity without making the
cache associative
Victim cache
Better, randomized indexing into the cache
Software hints for eviction/replacement/promotion
Capacity
Utilize cache space better: keep blocks that will be referenced
Software management: divide working set and computation
such that each “computation phase” fits in cache
100
How to Improve Cache Performance
Three fundamental goals
101
Improving Basic Cache Performance
Reducing miss rate
More associativity
Alternatives/enhancements to associativity
Victim caches, hashing, pseudo-associativity, skewed associativity
Better replacement/insertion policies
Software approaches
Reducing miss latency/cost
Multi-level caches
Critical word first
Subblocking/sectoring
Better replacement/insertion policies
Non-blocking caches (multiple cache misses in parallel)
Multiple accesses per cycle
Software approaches
102
Software Approaches for Higher Hit Rate
Restructuring data access patterns
Restructuring data layout
Loop interchange
Data structure separation/merging
Blocking
…
103
Restructuring Data Access Patterns (I)
Idea: Restructure data layout or data access patterns
Example: If column-major
x[i+1,j] follows x[i,j] in memory
x[i,j+1] is far away from x[i,j]
105
Research Opportunities
Research Opportunities
If you are interested in doing research in Computer
Architecture, Security, Systems & Bioinformatics:
Email me and Prof. Mutlu with your interest
Take the seminar course and the “Computer Architecture” course
Do readings and assignments on your own & talk with us
109
Bachelor’s Seminar in Computer Architecture
110
Research Opportunities
If you are interested in doing research in Computer
Architecture, Security, Systems & Bioinformatics:
Email me and Prof. Mutlu with your interest
Take the seminar course and the “Computer Architecture” course
Do readings and assignments on your own & talk with us
https://www.youtube.com/watch?v=mV2OuB2djEs
Digital Design & Computer Arch.
Lecture 24: Memory Hierarchy
and Caches
Frank K. Gürkaynak
Mohammad Sadrosadati
Prof. Onur Mutlu
ETH Zürich
Spring 2024
24 May 2024
Miss Latency/Cost
What is miss latency or miss cost affected by?
116
An Example
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
P4 P3
S1Cache
P2
S2 S3 P1
P4 S1
P3 S2
P2 P1
S3 P4P4P3S1P2
P4S2P1
P3S3P4
P2 P3
S1 P2P4S2P3 P2 S3
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement
Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement
Recommended: MLP-Aware Cache Replacement
How do we incorporate MLP/cost into replacement decisions?
How do we design a hybrid cache replacement policy?
119
Improving Basic Cache Performance
Reducing miss rate
More associativity
Alternatives/enhancements to associativity
Victim caches, hashing, pseudo-associativity, skewed associativity
https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=3 121
Lectures on Cache Optimizations (II)
https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=6 122
Lectures on Cache Optimizations (III)
https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=21 123
Lectures on Cache Optimizations
Computer Architecture, Fall 2017, Lecture 3
Cache Management & Memory Parallelism (ETH, Fall 2017)
https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBX
YFIZywZXCPl4M_&index=3
https://www.youtube.com/onurmutlulectures 124
Multi-Core Issues in Caching
126
DRAM BANKS
Caches in a Multi-Core System
DRAM INTERFACE
DRAM MEMORY
CORE 1
CORE 3
CONTROLLER
L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2
CORE 2
CORE 0
SHARED L3 CACHE
Caches in a Multi-Core System
Apple M1,
2021
Core Count:
8 cores/16 threads
L1 Caches:
32 KB per core
L2 Caches:
512 KB per core
L3 Cache:
32 MB shared
https://youtu.be/gqAYMx34euU 130
https://www.tech-critter.com/amd-keynote-computex-2021/
3D Stacking Technology: Example
Cores:
15-16 cores,
8 threads/core
L2 Caches:
2 MB per core
L3 Cache:
120 MB shared
https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 132
Caches in a Multi-Core System
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 134
Caches in Multi-Core Systems
Cache efficiency becomes even more important in a multi-
core/multi-threaded system
Memory bandwidth is at premium
Cache space is a limited resource across cores/threads
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE
136
Resource Sharing Concept and Advantages
Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
Example resources: functional units, pipeline, caches, buses,
memory, interconnects, storage
Why?
137
Resource Sharing Disadvantages
Resource sharing results in contention for resources
When the resource is not idle, another thread cannot use it
If space is occupied by one thread, another thread needs to re-
occupy it
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE
139
Shared Caches Between Cores
Advantages:
High effective capacity
Dynamic partitioning of available cache space
No fragmentation due to static partitioning
If one core does not utilize some space, another core can
Easier to maintain coherence (a cache block is in a single location)
Disadvantages
Slower access (cache not tightly coupled with the core)
Cores incur conflict misses due to other cores’ accesses
Misses due to inter-core interference
Some cores can destroy the hit rate of other cores
Guaranteeing a minimum level of service (or fairness) to each core is harder
(how much space, how much bandwidth?)
140
Example: Problem with Shared Caches
L1 $ L1 $
L2 $
……
L1 $ L1 $
L2 $
……
L1 $ L1 $
L2 $
……
144
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=17 145
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=29 146
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=30 147
Lectures on Multi-Core Cache Management
Computer Architecture, Fall 2018, Lecture 18b
Multi-Core Cache Management (ETH, Fall 2018)
https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=29
https://www.youtube.com/onurmutlulectures 148
Lectures on Memory Resource Management
https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=21 149
Lectures on Memory Resource Management
Computer Architecture, Fall 2020, Lecture 11a
Memory Controllers (ETH, Fall 2020)
https://www.youtube.com/watch?v=TeG773OgiMQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=20
Computer Architecture, Fall 2020, Lecture 11b
Memory Interference and QoS (ETH, Fall 2020)
https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=21
Computer Architecture, Fall 2020, Lecture 13
Memory Interference and QoS II (ETH, Fall 2020)
https://www.youtube.com/watch?v=Axye9VqQT7w&list=PL5Q2soXY2Zi9xidyIgBxU
z7xRPS-wisBN&index=26
Computer Architecture, Fall 2020, Lecture 2a
Memory Performance Attacks (ETH, Fall 2020)
https://www.youtube.com/watch?v=VJzZbwgBfy8&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=2
https://www.youtube.com/onurmutlulectures 150
Cache Coherence
Cache Coherence
Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?
P1 P2
Interconnection Network
1000
x
Main Memory
152
The Cache Coherence Problem
P1 P2 ld r2, x
1000
Interconnection Network
1000
x
Main Memory
153
The Cache Coherence Problem
P1 P2 ld r2, x
Interconnection Network
1000
x
Main Memory
154
The Cache Coherence Problem
P1 P2 ld r2, x
Interconnection Network
1000
x
Main Memory
155
The Cache Coherence Problem
P1 P2 ld r2, x
Interconnection Network
1000
x
Main Memory
156
Hardware Cache Coherence
Basic idea:
A processor/cache broadcasts its write/update to a memory
location to all other processors
Another processor/cache that has the location either updates
or invalidates its local copy
157
A Very Simple Coherence Scheme (VI)
Idea: All caches “snoop” (observe) each other’s write/read
operations. If a processor writes to a block, all others
invalidate the block.
A simple protocol:
PrRd/-- PrWr / BusWr Write-through, no-
write-allocate
cache
Valid Actions of the local
BusWr processor on the
PrRd / BusRd cache block: PrRd,
PrWr,
Invalid Actions that are
broadcast on the
PrWr / BusWr bus for the block:
BusRd, BusWr
158
Lecture on Cache Coherence
https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=38 159
Lecture on Memory Ordering & Consistency
https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=37 160
Lecture on Cache Coherence & Consistency
Computer Architecture, Fall 2020, Lecture 21
Cache Coherence (ETH, Fall 2020)
https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=38
162
Two Cache Coherence Methods
How do we ensure that the proper caches are updated?
An example mechanism:
For each cache block in memory, store P+1 bits in directory
One bit for each cache, indicating whether the block is in cache
Exclusive bit: indicates that a cache has the only copy of the block
and can update it without notifying others
On a read: set the cache’s bit and arrange the supply of data
On a write: invalidate all caches that have the block and reset
their bits
Have an “exclusive bit” associated with each block in each cache
(so that the cache can update the exclusive block silently)
164
Directory Based Coherence Example (I)
165
Directory Based Coherence Example (I)
166
Maintaining Coherence
Need to guarantee that all processors see a consistent
value (i.e., consistent updates) for the same memory
location
On a Read:
If local copy is Invalid, put out request
168
Coherence: Update vs. Invalidate (II)
On a Write:
Read block into cache as before
Update Protocol:
Write to block, and simultaneously broadcast written
data and address to sharers
(Other nodes update the data in their caches if block is
present)
Invalidate Protocol:
Write to block, and simultaneously broadcast invalidation
of address to sharers
(Other nodes invalidate block in their caches if block is
present)
169
Update vs. Invalidate Tradeoffs
Which one is better? Update or invalidate?
Write frequency and sharing behavior are critical
Update
+ If sharer set is constant and updates are infrequent, avoids
the cost of invalidate-reacquire (broadcast update pattern)
- If data is rewritten without intervening reads by other cores,
updates would be useless
- Write-through cache policy bus can become a bottleneck
Invalidate
+ After invalidation, core has exclusive access rights
+ Only cores that keep reading after each write retain a copy
- If write contention is high, leads to ping-ponging (rapid
invalidation-reacquire traffic from different processors)
170
Additional Slides:
Memory Interference
171
Inter-Thread/Application Interference
Problem: Threads share the memory system, but memory
system does not distinguish between threads’ requests
172
Unfair Slowdowns due to Interference
matlab gcc
(Core 1)
(Core 0) (Core 2)
(Core 1)
Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service 173
Uncontrolled Interference: An Example
CORE
stream1 random2
CORE Multi-Core
Chip
L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System
174
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B
STREAM RANDOM
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive
175
What Does the Memory Hog Do?
Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer
176
DRAM Controllers
A row-conflict memory access takes significantly longer
than a row-hit access
177
Effect of the Memory Performance Hog
3
2.82X slowdown
2.5
Slowdown 2
0.5
0
STREAM RANDOM
Virtual
gcc PC
178
Greater Problem with More Cores
179
Greater Problem with More Cores
180
Distributed DoS in Networked Multi-Core Systems
Attackers Stock option pricing application
(Cores 1-8) (Cores 9-64)
181
More on Memory Performance Attacks
Thomas Moscibroda and Onur Mutlu,
"Memory Performance Attacks: Denial of Memory Service
in Multi-Core Systems"
Proceedings of the 16th USENIX Security Symposium (USENIX
SECURITY), pages 257-274, Boston, MA, August 2007. Slides
(ppt)
182
http://www.youtube.com/watch?v=VJzZbwgBfy8
More on Interconnect Based Starvation
Boris Grot, Stephen W. Keckler, and Onur Mutlu,
"Preemptive Virtual Clock: A Flexible, Efficient, and Cost-
effective QOS Scheme for Networks-on-Chip"
Proceedings of the 42nd International Symposium on
Microarchitecture (MICRO), pages 268-279, New York, NY,
December 2009. Slides (pdf)
183
Energy Comparison
of Memory Technologies
The Problem: Energy
Faster is more energy-efficient
SRAM, ~5 pJ
DRAM, ~40-140 pJ
PCM-DIMM (Intel Optane DC DIMM), ~80-540 pJ
PCM-SSD, ~120 µJ
Flash memory, ~250 µJ
Hard Disk, ~60 mJ
185
The Problem (Table View): Energy
Bigger is slower Faster is more energy-efficient
Cost per Energy per Energy per
Memory Device Capacity Latency
Megabyte access byte access
PCM-DIMM
~300
(Intel Optane DC Gigabyte < 0.004$ ~80-540 pJ ~20-135 pJ
nanosec
DIMM)
PCM-SSD Gigabyte
~6-10 µs < 0.002$ ~120 µJ ~30 nJ
(Intel Optane SSD) ~Terabyte
Gigabyte
Flash memory ~50-100 µs < 0.00008$ ~250 µJ ~61 nJ
~Terabyte
188
How is data found?
Cache organized into S sets
189
Direct Mapped Cache
Address
11...11111100 mem[0xFF...FC]
11...11111000 mem[0xFF...F8]
11...11110100 mem[0xFF...F4]
11...11110000 mem[0xFF...F0]
11...11101100 mem[0xFF...EC]
11...11101000 mem[0xFF...E8]
11...11100100 mem[0xFF...E4]
11...11100000 mem[0xFF...E0]
00...00100100 mem[0x00...24]
00...00100000 mem[0x00..20] Set Number
00...00011100 mem[0x00..1C] 7 (111)
00...00011000 mem[0x00...18] 6 (110)
00...00010100 mem[0x00...14] 5 (101)
00...00010000 mem[0x00...10] 4 (100)
00...00001100 mem[0x00...0C] 3 (011)
00...00001000 mem[0x00...08] 2 (010)
00...00000100 mem[0x00...04] 1 (001)
00...00000000 mem[0x00...00] 0 (000)
8-entry x
(1+27+32)-bit
SRAM
27 32
Hit Data
191
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)
192
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)
193
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)
194
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)
195
N-Way Set Associative Cache
Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data
28 32 28 32
= =
0
Hit1 Hit0 Hit1
32
Hit Data
196
N-way Set Associative Performance
# MIPS assembly code
Miss Rate =
addi $t0, $0, 5
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:
Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0
197
N-way Set Associative Performance
# MIPS assembly code
Miss Rate = 2/10
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
= 20%
lw $t1, 0x4($0)
lw $t2, 0x24($0) Associativity reduces
addi $t0, $t0, -1 conflict misses
j loop
done:
Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0
198
Fully Associative Cache
No conflict misses
Expensive to build
V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data
199
Spatial Locality?
Increase block size:
Block size, b = 4 words
C = 8 words
Direct mapped (1 block per set)
Number of blocks, B = C/b = 8/4 = 2
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data
200
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
Miss Rate =
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
done:
Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data
201
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
Miss Rate = 1/15
lw
lw
$t1,
$t2,
0x4($0)
0xC($0)
= 6.67%
lw $t3, 0x8($0)
addi $t0, $t0, -1 Larger blocks reduce
j loop compulsory misses through
done:
spatial locality
Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11
10
01
00
32
=
Hit Data
202
Cache Organization Recap
Main Parameters
Capacity: C
Block size: b
Number of blocks in cache: B = C/b
Number of blocks in a set: N
Number of Sets: S = B/N
Fully Associative B 1
203
Capacity Misses
Cache is too small to hold all data of interest at one time
If the cache is full and program tries to access data X that is
not in cache, cache must evict data Y to make room for X
Capacity miss occurs if program then tries to access Y again
X will be placed in a particular set based on its address
204
Types of Misses
Compulsory: first time data is accessed
205
LRU Replacement
# MIPS assembly
lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0)
206
LRU Replacement
# MIPS assembly
lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0)
Way 1 Way 0
208
Issues in Set-Associative Caches
Think of each block in a set having a “priority”
Indicating how important it is to keep the block in the cache
Key issue: How do you determine/adjust block priorities?
There are three key decisions in a set:
Insertion, promotion, eviction (replacement)
210
Implementing LRU
Idea: Evict the least recently accessed block
Problem: Need to keep track of access ordering of blocks
Why?
True LRU is complex
LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)
Examples:
Not MRU (not most recently used)
Hierarchical LRU: divide the N-way set into M “groups”, track
the MRU group and the MRU way in each group
Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
212
Cache Replacement Policy: LRU or Random
LRU vs. Random: Which one is better?
Example: 4-way cache, cyclic references to A, B, C, D, E
0% hit rate with LRU policy
Set thrashing: When the “program working set” in a set is
larger than set associativity
Random replacement policy is better when thrashing occurs
In practice:
Performance of replacement policy depends on workload
Average hit rate of LRU and Random are similar
214
Recommended Reading
Key observation: Some misses more costly than others as their latency is
exposed as stall time. Reducing miss rate is not always good for
performance. Cache replacement should take into account cost of misses.
215
What’s In A Tag Store Entry?
Valid bit
Tag
Replacement policy bits
Dirty bit?
Write back vs. write through caches
216
Handling Writes (I)
When do we write the modified data in a cache to the next level?
Write through: At the time the write happens
Write back: When the block is evicted
Write-back
+ Can combine multiple writes to the same block before eviction
Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”
Write-through
+ Simpler design
+ All levels are up to date & consistent Simpler cache coherence: no
need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes
217
Handling Writes (II)
Do we allocate a cache block on a write miss?
Allocate on write miss: Yes
No-allocate on write miss: No
No-allocate
+ Conserves cache space if locality of written blocks is low
(potentially better cache hit rate)
218
Handling Writes (III)
What if the processor writes to an entire block over a small
amount of time?
Is there any need to bring the block into the cache from
memory in the first place?
219
Subblocked (Sectored) Caches
Idea: Divide a block into subblocks (or sectors)
Have separate valid and dirty bits for each subblock (sector)
Allocate only a subblock (or a subset of subblocks) on a request
Apple M1,
2021
Core Count:
8 cores/16 threads
L1 Caches:
32 KB per core
L2 Caches:
512 KB per core
L3 Cache:
32 MB shared
https://youtu.be/gqAYMx34euU 226
https://www.tech-critter.com/amd-keynote-computex-2021/
Deeper and Larger Cache Hierarchies
IBM POWER10,
2020
Cores:
15-16 cores,
8 threads/core
L2 Caches:
2 MB per core
L3 Cache:
120 MB shared
https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 227
Deeper and Larger Cache Hierarchies
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 230
NVIDIA V100 & A100 Memory Hierarchy NVIDIA A100 Tensor Core GPU Architecture In-Depth
A100 feature:
Direct copy from L2
A100 improves SM bandwidth efficiency with a new load-global-store-shared asynchronous copy to scratchpad,
instruction that bypasses L1 cache and register file (RF). Additionally, A100’s more efficient Tensor
Cores reduce shared memory (SMEM) loads.
bypassing L1 and
register file.
Figure 15. A100 SM Data Movement Efficiency
New asynchronous barriers work together with the asynchronous copy instruction to enable
https://images.nvidia.com/aem-dam/ en-zz/Solutions/data-center/nvidia-ampere-archit ecture-whitepaper. pdf 231
efficient data fetch pipelines, and A100 increases maximum SMEM allocation per SM 1.7x to
Memory in the NVIDIA H100 GPU
SM SM SM
Control Control Control
Core Core Core Core Core Core Core Core Core Core Core Core
SM-to-SM
Direct copy L2 Cache 60 MB
Block size
Associativity
Replacement policy
Insertion/Placement policy
Promotion Policy
235
Cache Size
Cache size: total data (not including tag) capacity
bigger can exploit temporal locality better
hit rate
Too small blocks
do not exploit spatial locality well
have larger tag overhead
239
Associativity
How many blocks can be present in the same index (i.e., set)?
Larger associativity
lower miss rate (reduced conflicts)
higher hit latency and area cost
hit rate
Smaller associativity
lower cost
lower hit latency
Especially important for L1 caches
associativity
Is power of 2 associativity required?
240
Recall: Higher Associativity (4-way)
4-way
Tag store
=? =? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits
241
Higher Associativity (3-way)
3-way
Tag store
=? =? =?
Logic Hit?
Data store
MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits
242
Recall: 8-way Fully Associative Cache
Tag store
=? =? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits
243
7-way Fully Associative Cache
Tag store
=? =? =? =? =? =? =?
Logic
Hit?
Data store
MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits
244
Classification of Cache Misses
Compulsory miss
first reference to an address (block) always results in a miss
subsequent references should hit unless the cache block is
displaced for the reasons below
Capacity miss
cache is too small to hold all needed data
defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity
Conflict miss
defined as any miss that is neither a compulsory nor a
capacity miss
245
How to Reduce Each Miss Type
Compulsory
Caching (only accessed data) cannot help; larger blocks can
Prefetching helps: Anticipate which blocks will be needed soon
Conflict
More associativity
Other ways to get more associativity without making the
cache associative
Victim cache
Better, randomized indexing into the cache
Software hints for eviction/replacement/promotion
Capacity
Utilize cache space better: keep blocks that will be referenced
Software management: divide working set and computation
such that each “computation phase” fits in cache
246
How to Improve Cache Performance
Three fundamental goals
247
Improving Basic Cache Performance
Reducing miss rate
More associativity
Alternatives/enhancements to associativity
Victim caches, hashing, pseudo-associativity, skewed associativity
Better replacement/insertion policies
Software approaches
Reducing miss latency/cost
Multi-level caches
Critical word first
Subblocking/sectoring
Better replacement/insertion policies
Non-blocking caches (multiple cache misses in parallel)
Multiple accesses per cycle
Software approaches
248
Software Approaches for Higher Hit Rate
Restructuring data access patterns
Restructuring data layout
Loop interchange
Data structure separation/merging
Blocking
…
249
Restructuring Data Access Patterns (I)
Idea: Restructure data layout or data access patterns
Example: If column-major
x[i+1,j] follows x[i,j] in memory
x[i,j+1] is far away from x[i,j]
Blocking
Divide loops operating on arrays into computation chunks so
that each chunk can hold its data in the cache
Avoids cache conflicts between different chunks of
computation
Essentially: Divide the working set so that each piece fits in
the cache
251
Data Reuse: An Example from GPU Computing
Same memory locations accessed by neighboring threads
A C
i
M
k j
P N
254
Naïve Matrix Multiplication (II)
Naïve implementation of matrix multiplication has poor
cache locality
#define A(i,j) matrix_A[i * P + j]
#define B(i,j) matrix_B[i * N + j]
#define C(i,j) matrix_C[i * N + j]
A C
Consecutive accesses to B are far from
i
each other, in different cache lines. M
Every access to B is likely to cause a k j
cache miss
P N
255
Tiled Matrix Multiplication (I)
We can achieve better cache
locality by computing on B
smaller tiles or blocks that fit in
the cache k P
Or in the scratchpad memory
and register file if we compute
on a GPU
A C
i
tile_dim
M
j
k
tile_dim
P N
Lam+, "The cache performance and optimizations of blocked algorithms," ASPLOS 1991. https://doi.org/10.1145/106972.106981
Bansal+, "Chapter 15 - Fast Matrix Computations on Heterogeneous Streams," in "High Performance Parallelism Pearls", 2015. https://doi.org/10.1016/B978-0-12-803819-2.00011-2
256
Kirk & Hwu, "Chapter 5 - Performance considerations," in "Programming Massively Parallel Processors (Third Edition)", 2017. https://doi.org/10.1016/B978-0-12-811986-0.00005-4
Tiled Matrix Multiplication (II)
Tiled implementation operates on submatrices (tiles or
blocks) that fit fast memories (cache, scratchpad, RF)
#define A(i,j) matrix_A[i * P + j]
#define B(i,j) matrix_B[i * N + j]
#define C(i,j) matrix_C[i * N + j]
A C
Multiply small submatrices (tiles or blocks) tile_dim
of size tile_dim x tile_dim i
M
k j
tile_dim
P N
Lam+, "The cache performance and optimizations of blocked algorithms," ASPLOS 1991. https://doi.org/10.1145/106972.106981
Bansal+, "Chapter 15 - Fast Matrix Computations on Heterogeneous Streams," in "High Performance Parallelism Pearls", 2015. https://doi.org/10.1016/B978-0-12-803819-2.00011-2
257
Kirk & Hwu, "Chapter 5 - Performance considerations," in "Programming Massively Parallel Processors (Third Edition)", 2017. https://doi.org/10.1016/B978-0-12-811986-0.00005-4
Tiled Matrix Multiplication on GPUs
Computer Architecture - Lecture 9: GPUs and GPGPU Programming (Fall 2017) https://youtu.be/mgtlbEqn2dA?t=8157 258
Restructuring Data Layout (I)
Pointer based traversal
struct Node { (e.g., of a linked list)
struct Node* next; Frequently
int key; accessed Assume a huge linked
char [256] name; Rarely list (1B nodes) and
char [256] school; accessed unique keys
}
259
Restructuring Data Layout (II)
struct Node { Idea: separate rarely-
struct Node* next; accessed fields of a data
int key;
struct Node-data* node-data; structure and pack them into
} a separate data structure
struct Node-data {
char [256] name;
char [256] school; Who should do this?
}
Programmer
while (node) { Compiler
if (nodekey == input-key) { Profiling vs. dynamic
// access nodenode-data Hardware?
}
node = nodenext; Who can determine what is
} frequently accessed?
260
Improving Basic Cache Performance
Reducing miss rate
More associativity
Alternatives/enhancements to associativity
Victim caches, hashing, pseudo-associativity, skewed associativity
Better replacement/insertion policies
Software approaches
Reducing miss latency/cost
Multi-level caches
Critical word first
Subblocking/sectoring
Better replacement/insertion policies
Non-blocking caches (multiple cache misses in parallel)
Multiple accesses per cycle
Software approaches
261
Miss Latency/Cost
What is miss latency or miss cost affected by?
264
An Example
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
P4 P3
S1Cache
P2
S2 S3 P1
P4 S1
P3 S2
P2 P1
S3 P4P4P3S1P2
P4S2P1
P3S3P4
P2 P3
S1 P2P4S2P3 P2 S3
P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3
Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement
Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement
Recommended: MLP-Aware Cache Replacement
How do we incorporate MLP/cost into replacement decisions?
How do we design a hybrid cache replacement policy?
267
Improving Basic Cache Performance
Reducing miss rate
More associativity
Alternatives/enhancements to associativity
Victim caches, hashing, pseudo-associativity, skewed associativity
https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=3 269
Lectures on Cache Optimizations (II)
https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=6 270
Lectures on Cache Optimizations (III)
https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=21 271
Lectures on Cache Optimizations
Computer Architecture, Fall 2017, Lecture 3
Cache Management & Memory Parallelism (ETH, Fall 2017)
https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBX
YFIZywZXCPl4M_&index=3
https://www.youtube.com/onurmutlulectures 272
Multi-Core Issues in Caching
274
DRAM BANKS
Caches in a Multi-Core System
DRAM INTERFACE
DRAM MEMORY
CORE 1
CORE 3
CONTROLLER
L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2
CORE 2
CORE 0
SHARED L3 CACHE
Caches in a Multi-Core System
Apple M1,
2021
Core Count:
8 cores/16 threads
L1 Caches:
32 KB per core
L2 Caches:
512 KB per core
L3 Cache:
32 MB shared
https://youtu.be/gqAYMx34euU 278
https://www.tech-critter.com/amd-keynote-computex-2021/
3D Stacking Technology: Example
Cores:
15-16 cores,
8 threads/core
L2 Caches:
2 MB per core
L3 Cache:
120 MB shared
https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 280
Caches in a Multi-Core System
Cores:
128 Streaming Multiprocessors
L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad
L2 Cache:
40 MB shared
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 282
Caches in Multi-Core Systems
Cache efficiency becomes even more important in a multi-
core/multi-threaded system
Memory bandwidth is at premium
Cache space is a limited resource across cores/threads
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE
284
Resource Sharing Concept and Advantages
Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
Example resources: functional units, pipeline, caches, buses,
memory
Why?
285
Resource Sharing Disadvantages
Resource sharing results in contention for resources
When the resource is not idle, another thread cannot use it
If space is occupied by one thread, another thread needs to re-
occupy it
L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE
287
Shared Caches Between Cores
Advantages:
High effective capacity
Dynamic partitioning of available cache space
No fragmentation due to static partitioning
If one core does not utilize some space, another core can
Easier to maintain coherence (a cache block is in a single location)
Disadvantages
Slower access (cache not tightly coupled with the core)
Cores incur conflict misses due to other cores’ accesses
Misses due to inter-core interference
Some cores can destroy the hit rate of other cores
Guaranteeing a minimum level of service (or fairness) to each core is harder
(how much space, how much bandwidth?)
288
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=17 289
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=29 290
Lectures on Multi-Core Cache Management
https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=30 291
Lectures on Multi-Core Cache Management
Computer Architecture, Fall 2018, Lecture 18b
Multi-Core Cache Management (ETH, Fall 2018)
https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=29
https://www.youtube.com/onurmutlulectures 292
Lectures on Memory Resource Management
https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=21 293
Lectures on Memory Resource Management
Computer Architecture, Fall 2020, Lecture 11a
Memory Controllers (ETH, Fall 2020)
https://www.youtube.com/watch?v=TeG773OgiMQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=20
Computer Architecture, Fall 2020, Lecture 11b
Memory Interference and QoS (ETH, Fall 2020)
https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=21
Computer Architecture, Fall 2020, Lecture 13
Memory Interference and QoS II (ETH, Fall 2020)
https://www.youtube.com/watch?v=Axye9VqQT7w&list=PL5Q2soXY2Zi9xidyIgBxU
z7xRPS-wisBN&index=26
Computer Architecture, Fall 2020, Lecture 2a
Memory Performance Attacks (ETH, Fall 2020)
https://www.youtube.com/watch?v=VJzZbwgBfy8&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=2
https://www.youtube.com/onurmutlulectures 294
Cache Coherence
Cache Coherence
Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?
P1 P2
Interconnection Network
1000
x
Main Memory
The Cache Coherence Problem
P1 P2 ld r2, x
1000
Interconnection Network
1000
x
Main Memory
The Cache Coherence Problem
P1 P2 ld r2, x
Interconnection Network
1000
x
Main Memory
The Cache Coherence Problem
P1 P2 ld r2, x
1000
x
Main Memory
The Cache Coherence Problem
P1 P2 ld r2, x
Interconnection Network
1000
x
Main Memory
A Very Simple Coherence Scheme (VI)
Idea: All caches “snoop” (observe) each other’s write/read
operations. If a processor writes to a block, all others
invalidate the block.
A simple protocol:
PrRd/-- PrWr / BusWr Write-through, no-
write-allocate
cache
Valid Actions of the local
BusWr processor on the
PrRd / BusRd cache block: PrRd,
PrWr,
Invalid Actions that are
broadcast on the
PrWr / BusWr bus for the block:
BusRd, BusWr
301
Lecture on Cache Coherence
https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=38 302
Lecture on Memory Ordering & Consistency
https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=37 303
Lecture on Cache Coherence & Consistency
Computer Architecture, Fall 2020, Lecture 21
Cache Coherence (ETH, Fall 2020)
https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=38