0% found this document useful (0 votes)
39 views304 pages

Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture

Uploaded by

angelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views304 pages

Ddca 2024 Lecture24 Memory Hierarchy and Caches Beforelecture

Uploaded by

angelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 304

Digital Design & Computer Arch.

Lecture 24: Memory Hierarchy


and Caches
Frank K. Gürkaynak
Mohammad Sadrosadati
Prof. Onur Mutlu

ETH Zürich
Spring 2024
24 May 2024
The Memory Hierarchy
Memory Hierarchy in a Modern System (I)

L2 CACHE 1
L2 CACHE 0
SHARED L3 CACHE

DRAM INTERFACE

DRAM BANKS
CORE 0 CORE 1

DRAM MEMORY
CONTROLLER
L2 CACHE 2

L2 CACHE 3

CORE 2 CORE 3

AMD Barcelona, circa 2006 3


Memory Hierarchy in a Modern System (II)

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 4
Memory Hierarchy in a Modern System (III)

A lot of
Storage DRAM SRAM DRAM Storage

Apple M1 Ultra System (2022)


https://www.gsmarena.com/apple_announces_m1_ult ra_with_20core_cpu_and_64c ore_gpu-news -53481.php 5
Memory Hierarchy in an Older System

Processor chip Level 2 cache chip

Multi-chip module package

Intel Pentium Pro, 1995


By Moshen - http://en.wikipedia.org/wiki/Image:P entiumpro_moshen.jpg, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=2262471
6
Memory Hierarchy in an Older System
L2 Cache

7
https://download.intel.com/newsroom/kits/40thanniversary/gallery/images/Pentium_4_6xx-die.jpg
Intel Pentium 4, 2000
Memory Hierarchy in a Modern System (IV)

Core Count:
8 cores/16 threads

L1 Caches:
32 KB per core

L2 Caches:
512 KB per core

L3 Cache:
32 MB shared

AMD Ryzen 5000, 2020


https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/ 8
Memory Hierarchy in a Modern System (V)
IBM POWER10,
2020

Cores:
15-16 cores,
8 threads/core

L2 Caches:
2 MB per core

L3 Cache:
120 MB shared

https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 9
Memory Hierarchy in a Modern System (VI)

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 10
Ideal Memory
 Zero access time (latency)
 Infinite capacity
 Zero cost
 Infinite bandwidth (to support multiple accesses in parallel)
 Zero energy

11
The Problem
 Ideal memory’s requirements oppose each other

 Bigger is slower
 Bigger  Takes longer to determine the location

 Faster is more expensive


 Memory technology: SRAM vs. DRAM vs. SSD vs. Disk vs. Tape

 Higher bandwidth is more expensive


 Need more banks, more ports, more channels, higher frequency
or faster technology

12
The Problem
 Bigger is slower
 SRAM, < 1KByte, sub-nanosec
 SRAM, KByte~MByte, ~nanosec
 DRAM, Gigabyte, ~50 nanosec
 PCM-DIMM (Intel Optane DC DIMM), Gigabyte, ~300 nanosec
 PCM-SSD (Intel Optane SSD), Gigabyte ~Terabyte, ~6-10 µs
 Flash memory, Gigabyte~Terabyte, ~50-100 µs
 Hard Disk, Terabyte, ~10 millisec
 Faster is more expensive (monetary cost and chip area)
 SRAM, < 0.3$ per Megabyte
 DRAM, < 0.006$ per Megabyte
 PCM-DIMM (Intel Optane DC DIMM), < 0.004$ per Megabyte
 PCM-SSD, < 0.002$ per Megabyte
 Flash memory, < 0.00008$ per Megabyte
 Hard Disk, < 0.00003$ per Megabyte
 These sample values (circa ~2023) scale with time
 Other technologies have their place as well
 FeRAM, MRAM, RRAM, STT-MRAM, memristors, … (not mature yet)
13
The Problem (Table View)
Bigger is slower
Memory Device Capacity Latency Cost per Megabyte
SRAM < 1 KByte sub-nanosec
SRAM KByte~MByte ~nanosec < 0.3$
DRAM Gigabyte ~50 nanosec < 0.006$
PCM-DIMM
Gigabyte ~300 nanosec < 0.004$
(Intel Optane DC DIMM)
PCM-SSD Gigabyte ~6-10 µs
< 0.002$
(Intel Optane SSD) ~Terabyte
Gigabyte ~50-100 µs
Flash memory < 0.00008$
~Terabyte
~10 millisec
Hard Disk Terabyte < 0.00003$

Faster is more expensive


($$$ and chip area)
These sample values (circa ~2023) scale with time
14
The Problem (Table View): Energy
Bigger is slower Faster is more energy-efficient
Cost per Energy per Energy per
Memory Device Capacity Latency
Megabyte access byte access

SRAM < 1 KByte sub-nanosec


~5 pJ ~1.25 pJ
KByte~MB
SRAM ~nanosec < 0.3$
yte

DRAM Gigabyte ~50 nanosec < 0.006$ ~40-140 pJ ~10-35 pJ

PCM-DIMM
~300
(Intel Optane DC Gigabyte < 0.004$ ~80-540 pJ ~20-135 pJ
nanosec
DIMM)
PCM-SSD Gigabyte
~6-10 µs < 0.002$ ~120 µJ ~30 nJ
(Intel Optane SSD) ~Terabyte

Gigabyte
Flash memory ~50-100 µs < 0.00008$ ~250 µJ ~61 nJ
~Terabyte

Hard Disk Terabyte ~10 millisec < 0.00003$ ~60 mJ ~15 µJ

Faster is more expensive


($$$ and chip area)
These sample values (circa ~2023) scale with time

Disclaimer: Take the energy values with a grain of salt as there are different assumptions
Aside: The Problem (2011 Version)
 Bigger is slower
 SRAM, 512 Bytes, sub-nanosec
 SRAM, KByte~MByte, ~nanosec
 DRAM, Gigabyte, ~50 nanosec
 Hard Disk, Terabyte, ~10 millisec

 Faster is more expensive (monetary cost and chip area)


 SRAM, < 10$ per Megabyte
 DRAM, < 1$ per Megabyte
 Hard Disk < 1$ per Gigabyte
 These sample values (circa ~2011) scale with time

 Other technologies have their place as well


 Flash memory (mature), PC-RAM, MRAM, RRAM (not mature yet)
16
Why Memory Hierarchy?
 We want both fast and large

 But, we cannot achieve both with a single level of memory

 Idea: Have multiple levels of storage (progressively bigger


and slower as the levels are farther from the processor)
and ensure most of the data the processor needs is kept in
the fast(er) level(s)

17
The Memory Hierarchy

move what you use here fast


small

With good locality of


reference, memory

cheaper per byte


appears as fast as

faster per byte


and as large as

back up
everything large but slow
here
18
Memory Hierarchy
 Fundamental tradeoff
 Fast memory: small
 Large memory: slow
 Idea: Memory hierarchy

Hard Disk
Main
CPU Cache Memory
RF (DRAM)

 Latency, cost, size,


bandwidth

19
Memory Hierarchy Example

Kim & Mutlu, “Memory Systems,” Computing Handbook, 2014


https://people.inf.ethz.ch/omutlu/pub/memory-systems-introduction_computing-handbook14.pdf
20
Locality
 One’s recent past is a very good predictor of their near
future

 Temporal Locality: If you just did something, it is very


likely that you will do the same thing again soon
 since you are here today, there is a good chance you will be
here again and again regularly

 Spatial Locality: If you did something, it is very likely you


will do something similar/related (in space)
 every time I find you in this room, you are probably sitting
close to the same people AND/OR in closeby seats

21
Memory Locality
 A “typical” program has a lot of locality in memory
references
 typical programs are composed of “loops”

 Temporal: A program tends to reference the same memory


location many times and all within a small window of time

 Spatial: A program tends to reference nearby memory


locations within a window of time
 most notable examples:
1. instruction memory references  mostly sequential/streaming
2. references to arrays/vectors  often streaming/strided

22
Caching Basics: Exploit Temporal Locality
 Idea: Store recently accessed data in automatically-managed
fast memory (called cache)
 Anticipation: same mem. location will be accessed again soon

 Temporal locality principle


 Recently accessed data will be again accessed in the near future
 This is what Maurice Wilkes had in mind:
 Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE
Trans. On Electronic Computers, 1965.
 “The use is discussed of a fast core memory of, say 32000 words
as a slave to a slower core memory of, say, one million words in
such a way that in practical cases the effective access time is
nearer that of the fast memory than that of the slow memory.”

23
Caching Basics: Exploit Spatial Locality
 Idea: Store data in addresses adjacent to the recently
accessed one in automatically-managed fast memory
 Logically divide memory into equal-size blocks
 Fetch to cache the accessed block in its entirety
 Anticipation: nearby memory locations will be accessed soon

 Spatial locality principle


 Nearby data in memory will be accessed in the near future
 E.g., sequential instruction access, array traversal
 This is what IBM 360/85 implemented
 16 Kbyte cache with 64 byte blocks
 Liptay, “Structural aspects of the System/360 Model 85 II: the
cache,” IBM Systems Journal, 1968.

24
The Bookshelf Analogy
 Book in your hand
 Desk
 Bookshelf
 Boxes at home
 Boxes in storage

 Recently-used books tend to stay on desk


 Comp Arch books, books for classes you are currently taking
 Until the desk gets full
 Adjacent books in the shelf needed around the same time
 If I have organized/categorized my books well in the shelf

25
Caching in a Pipelined Design
 The cache needs to be tightly integrated into the pipeline
 Ideally, access in 1-cycle so that load-dependent operations
do not stall
 High frequency pipeline  Cannot make the cache large
 But, we want a large cache AND a pipelined design
 Idea: Cache hierarchy

Main
Level 2 Memory
CPU Level1 Cache (DRAM)
RF Cache

26
A Note on Manual vs. Automatic Management
 Manual: Programmer manages data movement across levels
-- too painful for programmers on substantial programs
 “core” vs “drum” memory in the 1950s

 done in embedded processors (on-chip scratchpad SRAM in lieu

of a cache), GPUs (called “shared memory”), ML accelerators, …

 Automatic: Hardware manages data movement across levels,


transparently to the programmer
++ programmer’s life is easier
 the average programmer doesn’t need to know about caches

 You don’t need to know how big the cache is and how it works to
write a “correct” program! (What if you want a “fast” program?)

27
Caches and Scratchpad in a Modern GPU

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 28
Caches and Scratchpad in a Modern GPU
Nvidia Hopper, 2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad
https://wccftech.com/nvidia-hopper-gpus-featuring-mcm-technology-tape-out-soon-rumor/ 29
Caches and Scratchpad in a Modern GPU
Nvidia Hopper,
2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 30
Cerebras’s Wafer Scale Engine (2019)
 The largest ML accelerator chip

 400,000 cores

 18 GB of on-chip memory

 9 PB/s memory bandwidth

Cerebras WSE Largest GPU


1.2 Trillion transistors 21.1 Billion transistors
46,225 mm2 815 mm2
NVIDIA TITAN V
https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning
31
https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
Scratchpad Memory in Cerebras WSE

4539 tiles 84 dies


 Scratchpad Memory
 Highly parallel and distributed scratchpad SRAM memory with
2D mesh interconnection fabric across tiles
 16-byte read and 8-byte write single-cycle latency
 48 KB scratchpad in each tile, totaling 18 GB on the full chip
 No shared memory
Rocki et al., “Fast stencil-code computation on a wafer-scale processor.” SC 2020. 32
Cerebras’s Wafer Scale Engine-2 (2021)
 The largest ML accelerator chip

 850,000 cores

 40 GB of on-chip memory

 20 PB/s memory bandwidth

Cerebras WSE-2 Largest GPU


2.6 Trillion transistors 54.2 Billion transistors
46,225 mm2 826 mm2
NVIDIA Ampere GA100
https://cerebras.net/product/#overview
33
A Historical Perspective

Magnetic Drum Memory Magnetic Core Memory


Main Memory of 1950s-1960s Main Memory of 1960s-1970s

Public Domain, https://commons.wikimedia.org/w/index.php?curid=239809

By Orion 8 - Combined from Magnetic core memory card.jpg and Magnetic core.jpg., CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=11235412
34
A Historical Perspective

Magnetic Drum Memory Magnetic Core Memory


Main Memory of 1950s-1960s Main Memory of 1960s-1970s

Public Domain, https://commons.wikimedia.org/w/index.php?curid=239809

By Orion 8 - Combined from Magnetic core memory card.jpg and Magnetic core.jpg., CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=11235412
35
Automatic Management in Memory Hierarchy
 Wilkes, “Slave Memories and Dynamic Storage Allocation,”
IEEE Trans. On Electronic Computers, 1965.

 “By a slave memory I mean one which automatically


accumulates to itself words that come from a slower main
memory, and keeps them available for subsequent use
without it being necessary for the penalty of main memory
access to be incurred again.”
36
Historical Aside: Other Cache Papers
 Fotheringham, “Dynamic Storage Allocation in the Atlas
Computer, Including an Automatic Use of a Backing Store,”
CACM 1961.
 http://dl.acm.org/citation.cfm?id=366800

 Bloom, Cohen, Porter, “Considerations in the Design of a


Computer with High Logic-to-Memory Speed Ratio,” AIEE
Gigacycle Computing Systems Winter Meeting, Jan. 1962.

37
Cache in 1962 (Bloom, Cohen, Porter)

Data Store

Tag Store
38
A Modern Memory Hierarchy
Register File
32 words, sub-nsec
manual/compiler
Memory register spilling
L1 cache
Abstraction ~10s of KB, ~nsec

L2 cache
100s of KB ~ few MB, many nsec automatic
HW cache
L3 cache, management
many MBs, even more nsec

Main memory (DRAM),


Many GBs, ~100 nsec
automatic
Swap Disk
demand
~100 GB or few TB, ~10s of usec-msec paging
39
Hierarchical Latency Analysis
 A given memory hierarchy level i has intrinsic access time of ti
 It also has perceived access time Ti that is longer than ti
 Except for the outer-most hierarchy level, when looking for a given
address there is
 a chance (hit-rate hi) you “hit” and access time is ti

 a chance (miss-rate mi) you “miss” and access time ti +Ti+1

 hi + mi = 1

 Thus
Ti = hi·ti + mi·(ti + Ti+1)
Ti = ti + mi ·Ti+1

hi and mi are defined to be the hit-rate and miss-rate


of only the references that missed at Li-1
40
Hierarchy Design Considerations
 Recursive latency equation
Ti = ti + mi ·Ti+1
 The goal: achieve desired T1 within allowed cost
 Ti  ti is desirable

 Keep mi low
 increasing capacity Ci lowers mi, but beware of increasing ti
 lower mi by smarter cache management (replacement::anticipate
what you don’t need, prefetching::anticipate what you will need)

 Keep Ti+1 low


 faster outer hierarchy levels can help, but beware of increasing cost
 introduce intermediate hierarchy levels as a compromise
41
Intel Pentium 4 Example

Boggs et al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology Journal, 2004.
Intel Pentium 4 Example
L2 Cache

https://download.intel.com/newsroom/kits/40thanniversary/gallery/images/Pentium_4_6xx-die.jpg
43
Intel Pentium 4 Example
 90nm P4, 3.6 GHz Ti = ti + mi ·Ti+1
 L1 D-cache
if m1=0.1, m2=0.1
 C1 = 16 kB T1=7.6, T2=36
 t1 = 4 cyc int / 9 cycle fp
 L2 D-cache if m1=0.01, m2=0.01
T1=4.2, T2=19.8
 C2 = 1024 kB
 t2 = 18 cyc int / 18 cyc fp if m1=0.05, m2=0.01
 Main memory T1=5.00, T2=19.8
 t3 = ~ 50ns or 180 cyc if m1=0.01, m2=0.50
 Notice T1=5.08, T2=108
 best case latency is not 1
 worst case access latencies are into 500+ cycles
Cache Basics and Operation
Cache
 Any structure that “memoizes” used (or produced) data
 to avoid repeating the long-latency operations required to
reproduce/fetch the data from scratch
 e.g., a web cache

 Most commonly in the processor design context:


an automatically-managed memory structure
 e.g., memoize in fast SRAM the most frequently or recently
accessed DRAM memory locations to avoid repeatedly paying
for the DRAM access latency

46
Conceptual Picture of a Cache

Metadata

Kim & Mutlu, “Memory Systems,” Computing Handbook, 2014


https://people.inf.ethz.ch/omutlu/pub/memory-systems-introduction_computing-handbook14.pdf
47
Logical Organization of a Cache (I)
 A key question: How to map chunks of the main memory
address space to blocks in the cache?
 Which location in cache can a given “main memory chunk” be
placed in?

48
Logical Organization of a Cache (II)
 A key question: How to map chunks of the main memory
address space to blocks in the cache?
 Which location in cache can a given “main memory chunk” be
placed in?

Kim & Mutlu, “Memory Systems,” Computing Handbook, 2014 49


Caching Basics
 Block (line): Unit of storage in the cache
 Memory is logically divided into blocks that map to potential
locations in the cache

 On a reference:
 HIT: If in cache, use cached data instead of accessing memory
 MISS: If not in cache, bring block into cache
 May have to evict some other block

 Some important cache design decisions


 Placement: where and how to place/find a block in cache?
 Replacement: what data to remove to make room in cache?
 Granularity of management: large or small blocks? Subblocks?
 Write policy: what do we do about writes?
 Instructions/data: do we treat them separately?
50
Cache Abstraction and Metrics

Address
Tag Store Data Store

(is the address (stores


in the cache? memory
+ bookkeeping) blocks)

Hit/miss? Data

 Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)


 Average memory access time (AMAT)
= ( hit-rate * hit-latency ) + ( miss-rate * miss-latency )
 Important Aside: Is reducing AMAT always beneficial for performance?
51
A Basic Hardware Cache Design
 We will start with a basic hardware cache design

 Then, we will examine a multitude of ideas to make it


better (i.e., higher performance)

52
Blocks and Addressing the Cache
 Main memory logically divided into fixed-size chunks (blocks)
 Cache can house only a limited number of blocks

53
Blocks and Addressing the Cache
 Main memory logically divided into fixed-size chunks (blocks)
 Cache can house only a limited number of blocks

 Each block address maps to a potential location in the


cache, determined by the index bits in the address
 used to index into the tag and data stores tag index byte in block

2b 3 bits 3 bits

 Cache access: 8-bit address

1) index into the tag and data stores with index bits in address
2) check valid bit in tag store
3) compare tag bits in address with the stored tag in tag store

 If the stored tag is valid and matches the tag of the block,
then the block is in the cache (cache hit)
54
Let’s See A Toy Example
 We will examine a direct-mapped cache first
 Direct-mapped: A given main memory block can be placed in
only one possible location in the cache

 Toy example: 256-byte memory, 64-byte cache, 8-byte blocks

Kim & Mutlu, “Memory Systems,” Computing Handbook, 2014 55


Direct-Mapped Cache: Placement and Access
Assume byte-addressable main memory:
Block: 00000
Block: 00001 
Block: 00010
Block: 00011
Block: 00100
256 bytes, 8-byte blocks  32 blocks in mem
Assume cache: 64 bytes, 8 blocks
Block: 00101
Block: 00110 
Block: 00111
Block: 01000  Direct-mapped: A block can go to only one location
Block: 01001
Block: 01010 tag index byte in block
Block: 01011
Block: 01100 2b 3 bits 3 bits Tag store Data store
Block: 01101
Block: 01110 Address
Block: 01111
Block: 10000
Block: 10001
Block: 10010
Block: 10011
Block: 10100
Block: 10101 V tag
Block: 10110
Block: 10111
Block: 11000 byte in block
Block: 11001 =? MUX
Block: 11010
Block: 11011
Block: 11100 Hit? Data
Block: 11101
Block: 11110  Blocks with same index contend for the same cache location
Block: 11111  Cause conflict misses when accessed consecutively
Main memory 56
Direct-Mapped Caches
 Direct-mapped cache: Two blocks in memory that map to
the same index in the cache cannot be present in the cache
at the same time
 One index  one entry

 Can lead to 0% hit rate if more than one block accessed in


an interleaved manner map to the same index
 Assume addresses A and B have the same index bits but
different tag bits
 A, B, A, B, A, B, A, B, …  conflict in the cache index
 All accesses are conflict misses

57
Set Associativity
 Problem: Addresses N and N+8 always conflict in direct mapped cache
 Idea: enable blocks with the same index to map to > 1 cache location
 Example: Instead of having one column of 8, have 2 columns of 4 blocks
Tag store Data store
SET

V tag V tag

=? =? MUX

Logic byte in block


MUX
Hit?
Address
tag index byte in block Key idea: Associative memory within the set
3 bits 2 bits 3 bits + Accommodates conflicts better (fewer conflict misses)
-- More complex, slower access, larger tag store

2-way set associative cache: Blocks with the same index can map to 2 locations
Higher Associativity
 4-way Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX Address
tag index byte in block

+ Likelihood of conflict misses even lower 4 bits 1 b 3 bits

-- More tag comparators and wider data mux; larger tag store
4-way set associative cache: Blocks with the same index can map to 4 locations
Full Associativity
 Fully associative cache
 A block can be placed in any cache location

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits

Fully associative cache: Any block can map to any location in the cache
Associativity (and Tradeoffs)
 Degree of associativity: How many blocks can map to the
same index (or set)?

 Higher associativity
++ Higher hit rate
-- Slower cache access time (hit latency and data access latency)
-- More expensive hardware (more comparators)
hit rate

 Diminishing returns from higher


associativity

associativity
61
Issues in Set-Associative Caches
 Think of each block in a set having a “priority”
 Indicating how important it is to keep the block in the cache
 Key issue: How do you determine/adjust block priorities?
 There are three key decisions in a set:
 Insertion, promotion, eviction (replacement)

 Insertion: What happens to priorities on a cache fill?


 Where to insert the incoming block; whether or not to insert the block
 Promotion: What happens to priorities on a cache hit?
 Whether and how to change block priority
 Eviction/replacement: What happens to priorities on a cache
miss?
 Which block to evict and how to adjust priorities
62
Eviction/Replacement Policy
 Which block in the set to replace on a cache miss?
 Any invalid block first
 If all are valid, consult the replacement policy
 Random
 FIFO
 Least recently used (how to implement?)
 Not most recently used
 Least frequently used?
 Least costly to re-fetch?
 Why would memory accesses have different cost?
 Hybrid replacement policies
 Optimal replacement policy?

63
Implementing LRU
 Idea: Evict the least recently accessed block
 Problem: Need to keep track of access order of blocks

 Question: 2-way set associative cache:


 What do you minimally need to implement LRU perfectly?

 Question: 4-way set associative cache:


 What do you minimally need to implement LRU perfectly?
 How many different access orders are possible for the 4 blocks
in the set?
 How many bits needed to encode the LRU order of a block?
 What is the logic needed to determine the LRU victim?

 Repeat for N-way set associative cache


64
Approximations of LRU
 Most modern processors do not implement “true LRU”
(also called “perfect LRU”) in highly-associative caches

 Why?
 True LRU is complex
 LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)

 Examples:
 Not MRU (not most recently used)
 Hierarchical LRU: divide the N-way set into M “groups”, track
the MRU group and the MRU way in each group
 Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
65
Cache Replacement Policy: LRU or Random
 LRU vs. Random: Which one is better?
 Example: 4-way cache, cyclic references to A, B, C, D, E
 0% hit rate with LRU policy
 Set thrashing: When the “program working set” in a set is
larger than set associativity
 Random replacement policy is better when thrashing occurs
 In practice:
 Performance of replacement policy depends on workload
 Average hit rate of LRU and Random are similar

 Best of both Worlds: Hybrid of LRU and Random


 How to choose between the two? Set sampling
 See Qureshi et al., ”A Case for MLP-Aware Cache Replacement,”
ISCA 2006.
66
What Is the Optimal Replacement Policy?
 Belady’s OPT
 Replace the block that is going to be referenced furthest in the
future by the program
 Belady, “A study of replacement algorithms for a virtual-storage
computer,” IBM Systems Journal, 1966.
 How do we implement this? Simulate?

 Is this optimal for minimizing miss rate?


 Is this optimal for minimizing execution time?
 No. Cache miss latency/cost varies from block to block!
 Two reasons: Where miss is serviced from and miss overlapping
 Qureshi et al. “A Case for MLP-Aware Cache Replacement,"
ISCA 2006.

67
Recommended Reading
 Key observation: Some misses more costly than others as their latency is
exposed as stall time. Reducing miss rate is not always good for
performance. Cache replacement should take into account cost of misses.

 Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt,


"A Case for MLP-Aware Cache Replacement"
Proceedings of the 33rd International Symposium on Computer
Architecture (ISCA), pages 167-177, Boston, MA, June 2006. Slides (ppt)

68
What’s In A Tag Store Entry?
 Valid bit
 Tag
 Replacement policy bits

 Dirty bit?
 Write back vs. write through caches

69
Handling Writes (I)
 When do we write the modified data in a cache to the next level?
 Write through: At the time the write happens
 Write back: When the block is evicted

 Write-back cache
+ Can combine multiple writes to the same block before eviction
 Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”

 Write-through cache
+ Simpler design
+ All levels are up to date & consistent  Simpler cache coherence: no
need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes

70
Handling Writes (II)
 Do we allocate a cache block on a write miss?
 Allocate on write miss: Yes
 No-allocate on write miss: No

 Allocate on write miss


+ Can combine writes instead of writing each individually to next
level
+ Simpler because write misses can be treated the same way as
read misses
-- Requires transfer of the whole cache block

 No-allocate
+ Conserves cache space if locality of written blocks is low
(potentially better cache hit rate)
71
Handling Writes (III)
 What if the processor writes to an entire block over a small
amount of time?

 Is there any need to bring the block into the cache from
memory in the first place?

 Why do we not simply write to only a portion of the block,


i.e., subblock?
 E.g., 4 bytes out of 64 bytes
 Problem: Valid and dirty bits are associated with the entire 64
bytes, not with each individual 4 bytes

72
Subblocked (Sectored) Caches
 Idea: Divide a block into subblocks (or sectors)
 Have separate valid and dirty bits for each subblock (sector)
 Allocate only a subblock (or a subset of subblocks) on a request

++ No need to transfer the entire cache block into the cache


(A write simply validates and updates a subblock)
++ More freedom in transferring subblocks into the cache (a
cache block does not need to be in the cache fully)
(How many subblocks do you transfer on a read?)

-- More complex design; more valid and dirty bits


-- May not exploit spatial locality fully
v d subblock v d subblock v d subblock tag
73
Instruction vs. Data Caches
 Separate or Unified?

 Pros and Cons of Unified:


+ Dynamic sharing of cache space  better overall cache
utilization: no overprovisioning that might happen with static
partitioning of cache space (i.e., separate I and D caches)
-- Instructions and data can evict/thrash each other (i.e., no
guaranteed space for either)
-- I and D are accessed in different places in the pipeline. Where
do we place the unified cache for fast access?

 First level caches are almost always split


 Mainly for the last reason above – pipeline constraints
 Outer level caches are almost always unified
74
Multi-level Cache Design & Management
 Cache level greatly affects cache design & management

 First-level caches (instruction and data)


 Decisions very much affected by cycle time & pipeline structure
 Small, lower associativity; latency is critical
 Tag store and data store are usually accessed in parallel

 Second-level caches
 Decisions need to balance hit rate and access latency
 Usually large and highly associative; latency not as important
 Tag store and data store can be accessed serially

 Further-level (larger) caches


 Access energy is a larger problem due to cache sizes
 Tag store and data store are usually accessed serially
75
Serial vs. Parallel Access of Cache Levels
 Parallel: Next level cache accessed in parallel with the
previous level  a form of speculative access
+ Faster access to data if previous level misses
-- Unnecessary accesses to next level if previous level hits

 Serial: Next level cache accessed only if previous-level misses


-- Slower access to data if previous level misses
+ No wasted accesses to next level if previous level hits
 Next level does not see the same accesses as the previous

 Previous level acts as a filter (filters some temporal & spatial locality)
 Management policies are different across cache levels

76
Deeper and Larger Cache Hierarchies

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 77
Deeper and Larger Cache Hierarchies

Intel Alder Lake,


2021
Source: https://twitter.com/Locuza_/status/1454152714930331652 78
Deeper and Larger Cache Hierarchies

Core Count:
8 cores/16 threads

L1 Caches:
32 KB per core

L2 Caches:
512 KB per core

L3 Cache:
32 MB shared

AMD Ryzen 5000, 2020


https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/ 79
AMD’s 3D Last Level Cache (2021)
AMD increases the L3 size of their 8-core Zen 3
processors from 32 MB to 96 MB

Additional 64 MB L3 cache die


stacked on top of the processor die
- Connected using Through Silicon Vias (TSVs)
https://community.microcenter.com/discussion/5
134/comparing-zen-3-to-zen-2
- Total of 96 MB L3 cache

https://youtu.be/gqAYMx34euU 80
https://www.tech-critter.com/amd-keynote-computex-2021/
Deeper and Larger Cache Hierarchies
IBM POWER10,
2020

Cores:
15-16 cores,
8 threads/core

L2 Caches:
2 MB per core

L3 Cache:
120 MB shared

https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 81
Deeper and Larger Cache Hierarchies

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 82
Deeper and Larger Cache Hierarchies
Nvidia Hopper, 2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad
https://wccftech.com/nvidia-hopper-gpus-featuring-mcm-technology-tape-out-soon-rumor/ 83
Deeper and Larger Cache Hierarchies
Nvidia Hopper,
2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 84
NVIDIA V100 & A100 Memory Hierarchy NVIDIA A100 Tensor Core GPU Architecture In-Depth

 Example of data movement between GPU global memory


(DRAM) and GPU cores.

A100 feature:
Direct copy from L2
A100 improves SM bandwidth efficiency with a new load-global-store-shared asynchronous copy to scratchpad,
instruction that bypasses L1 cache and register file (RF). Additionally, A100’s more efficient Tensor
Cores reduce shared memory (SMEM) loads.
bypassing L1 and
register file.
Figure 15. A100 SM Data Movement Efficiency
https://images.nvidia.com/aem-dam/ en-zz/Solutions/data-center/nvidia-ampere-archit ecture-whitepaper. pdf
New asynchronous barriers work together with the asynchronous copy instruction to enable
85
efficient data fetch pipelines, and A100 increases maximum SMEM allocation per SM 1.7x to
Memory in the NVIDIA H100 GPU
SM SM SM
Control Control Control

≈1 cycle Registers Registers Registers

Core Core Core Core Core Core Core Core


… Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core

≈5 cycles Constant Cache Constant Cache Constant Cache

Shared Shared Shared


≈5 cycles L1 Cache L1 Cache L1 Cache
Memory Memory Memory

SM-to-SM
Direct copy L2 Cache 60 MB

≈500 cycles Global Memory 3 TB/s 80 GB

Slide credit: Izzat El Hajj 86


Multi-Level Cache Design Decisions
 Which level(s) to place a block into (from memory)?

 Which level(s) to evict a block to (from an inner level)?

 Bypassing vs. non-bypassing levels

 Inclusive, exclusive, non-inclusive hierarchies


 Inclusive: a block in an inner level is always included also in
an outer level  simplifies cache coherence
 Exclusive: a block in an inner level does not exist in an outer
level  better utilizes space in the entire hierarchy
 Non-inclusive: a block in an inner level may or may not be
included in an outer level  relaxes design decisions
87
Cache Performance
Cache Parameters vs. Miss/Hit Rate
 Cache size

 Block size

 Associativity

 Replacement policy
 Insertion/Placement policy
 Promotion Policy

89
Cache Size
 Cache size: total data capacity (not including tag store)
 bigger cache can exploit temporal locality better

 Too large a cache adversely affects hit and miss latency


 bigger is slower

 Too small a cache hit rate

 does not exploit temporal locality well


 useful data replaced often “working set”
size

 Working set: entire set of data


the executing application references
 Within a time interval cache size
90
Benefits of Larger Caches Widely Varies
 Benefits of cache size widely varies across applications
Misses per 1000 instructions

Low Cache Utility


High Cache Utility
Saturating Cache Utility

Number of ways from 16-way 1MB L2

Qureshi and Patt, “Utility-Based Cache Partitioning,” MICRO 2006. 91


Block Size
 Block size is the data that is associated with an address tag
 not necessarily the unit of transfer between hierarchies
 Sub-blocking: A block divided into multiple pieces (each w/ V/D bits)

hit rate
 Too small blocks
 do not exploit spatial locality well
 have larger tag overhead

 Too large blocks


 too few total blocks  do not
exploit temporal locality well
block
 waste cache space and bandwidth/energy size
if spatial locality is not high
92
Large Blocks: Critical-Word and Subblocking
 Large cache blocks can take a long time to fill into the cache
 Idea: Fill cache block critical-word first
 Supply the critical data to the processor immediately

 Large cache blocks can waste bus bandwidth


 Idea: Divide a block into subblocks
 Associate separate valid and dirty bits for each subblock
 Recall: When is this useful?

v d subblock v d subblock v d subblock tag

93
Associativity
 How many blocks can be present in the same index (i.e., set)?

 Larger associativity
 lower miss rate (reduced conflicts)
 higher hit latency and area cost
hit rate
 Smaller associativity
 lower cost
 lower hit latency
 Especially important for L1 caches

associativity
 Is power of 2 associativity required?
94
Recall: Higher Associativity (4-way)
 4-way
Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits

95
Higher Associativity (3-way)
 3-way
Tag store

=? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits

96
Recall: 8-way Fully Associative Cache

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits

97
7-way Fully Associative Cache

Tag store

=? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits

98
Classification of Cache Misses
 Compulsory miss
 first reference to an address (block) always results in a miss
 subsequent references to the block should hit in cache unless
the block is displaced from cache for the reasons below

 Capacity miss
 cache is too small to hold all needed data
 defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity

 Conflict miss
 defined as any miss that is neither a compulsory nor a
capacity miss
99
How to Reduce Each Miss Type
 Compulsory
 Caching (only accessed data) cannot help; larger blocks can
 Prefetching helps: Anticipate which blocks will be needed soon
 Conflict
 More associativity
 Other ways to get more associativity without making the
cache associative
 Victim cache
 Better, randomized indexing into the cache
 Software hints for eviction/replacement/promotion
 Capacity
 Utilize cache space better: keep blocks that will be referenced
 Software management: divide working set and computation
such that each “computation phase” fits in cache
100
How to Improve Cache Performance
 Three fundamental goals

 Reducing miss rate


 Caveat: reducing miss rate can reduce performance if more
costly-to-refetch blocks are evicted

 Reducing miss latency or miss cost

 Reducing hit latency or hit cost

 The above three together affect performance

101
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Better replacement/insertion policies
 Software approaches
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches

102
Software Approaches for Higher Hit Rate
 Restructuring data access patterns
 Restructuring data layout

 Loop interchange
 Data structure separation/merging
 Blocking
 …

103
Restructuring Data Access Patterns (I)
 Idea: Restructure data layout or data access patterns
 Example: If column-major
 x[i+1,j] follows x[i,j] in memory
 x[i,j+1] is far away from x[i,j]

Poor code Better code


for i = 1, rows for j = 1, columns
for j = 1, columns for i = 1, rows
sum = sum + x[i,j] sum = sum + x[i,j]

 This is called loop interchange


 Other optimizations can also increase hit rate
 Loop fusion, array merging, …
104
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Better replacement/insertion policies
 Software approaches
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches

105
Research Opportunities
Research Opportunities
 If you are interested in doing research in Computer
Architecture, Security, Systems & Bioinformatics:
 Email me and Prof. Mutlu with your interest
 Take the seminar course and the “Computer Architecture” course
 Do readings and assignments on your own & talk with us

 There are many exciting projects and research positions, e.g.:


 Novel memory/storage/computation/communication systems
 New execution paradigms (e.g., in-memory computing)
 Hardware security, safety, reliability, predictability
 GPUs, TPUs, FPGAs, PIM, heterogeneous systems, …
 Security-architecture-reliability-energy-performance interactions
 Architectures for genomics/proteomics/medical/health/AI/ML
 A limited list is here: https://safari.ethz.ch/theses/
107
https://people.inf.ethz.ch/omutlu/projects.htm
Bachelor’s Seminar in Computer Architecture
 Fall 2024 (offered every Fall and Spring Semester)
 2 credit units

 Rigorous seminar on fundamental and cutting-edge


topics in computer architecture
 Critical paper presentation, review, and discussion of seminal
and cutting-edge works in computer architecture
 We will cover many ideas & issues, analyze their tradeoffs,
perform critical thinking and brainstorming

 Participation, presentation, synthesis report, lots of discussion


 You can register for the course online
 https://safari.ethz.ch/architecture_seminar
108
Bachelor’s Seminar in Computer Architecture

109
Bachelor’s Seminar in Computer Architecture

110
Research Opportunities
 If you are interested in doing research in Computer
Architecture, Security, Systems & Bioinformatics:
 Email me and Prof. Mutlu with your interest
 Take the seminar course and the “Computer Architecture” course
 Do readings and assignments on your own & talk with us

 There are many exciting projects and research positions, e.g.:


 Novel memory/storage/computation/communication systems
 New execution paradigms (e.g., in-memory computing)
 Hardware security, safety, reliability, predictability
 GPUs, TPUs, FPGAs, PIM, heterogeneous systems, …
 Security-architecture-reliability-energy-performance interactions
 Architectures for genomics/proteomics/medical/health/AI/ML
 A limited list is here: https://safari.ethz.ch/theses/
111
https://people.inf.ethz.ch/omutlu/projects.htm
SAFARI Introduction & Research
Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://www.youtube.com/watch?v=mV2OuB2djEs
Digital Design & Computer Arch.
Lecture 24: Memory Hierarchy
and Caches
Frank K. Gürkaynak
Mohammad Sadrosadati
Prof. Onur Mutlu

ETH Zürich
Spring 2024
24 May 2024
Miss Latency/Cost
 What is miss latency or miss cost affected by?

 Where does the miss get serviced from?


 What level of cache in the hierarchy?
 Row hit versus row conflict in DRAM (bank/rank/channel conflict)
 Queueing delays in the memory controller and the interconnect
 Local vs. remote memory (chip, node, rack, remote server, …)
 …

 How much does the miss stall the processor?


 Is it overlapped with other latencies?
 Is the data immediately needed by the processor?
 Is the incoming block going to evict a longer-to-refetch block?
 …
114
Memory Level Parallelism (MLP)

isolated miss parallel miss


B
A
C
time

 Memory Level Parallelism (MLP) means generating and


servicing multiple memory accesses in parallel [Glew’98]
 Several techniques to improve MLP (e.g., out-of-order execution)
 MLP varies. Some misses are isolated and some parallel
How does this affect cache replacement?
Traditional Cache Replacement Policies
 Traditional cache replacement policies try to reduce miss
count

 Implicit assumption: Reducing miss count reduces memory-


related stall time

 Misses with varying cost/MLP breaks this assumption!

 Eliminating an isolated miss helps performance more than


eliminating a parallel miss
 Eliminating a higher-latency miss could help performance
more than eliminating a lower-latency miss

116
An Example

P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3

Misses to blocks P1, P2, P3, P4 can be parallel


Misses to blocks S1, S2, and S3 are isolated

Two replacement algorithms:


1. Minimizes miss count (Belady’s OPT)
2. Reduces isolated miss (MLP-Aware)

For a fully associative cache containing 4 blocks


Fewest Misses = Best Performance

P4 P3
S1Cache
P2
S2 S3 P1
P4 S1
P3 S2
P2 P1
S3 P4P4P3S1P2
P4S2P1
P3S3P4
P2 P3
S1 P2P4S2P3 P2 S3

P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3

Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement

Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement
Recommended: MLP-Aware Cache Replacement
 How do we incorporate MLP/cost into replacement decisions?
 How do we design a hybrid cache replacement policy?

 Qureshi et al., “A Case for MLP-Aware Cache Replacement,”


ISCA 2006.

119
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity

 Better replacement/insertion policies


 Software approaches
 …
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches
 …
120
Lectures on Cache Optimizations (I)

https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=3 121
Lectures on Cache Optimizations (II)

https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=6 122
Lectures on Cache Optimizations (III)

https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=21 123
Lectures on Cache Optimizations
 Computer Architecture, Fall 2017, Lecture 3
 Cache Management & Memory Parallelism (ETH, Fall 2017)
 https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBX
YFIZywZXCPl4M_&index=3

 Computer Architecture, Fall 2018, Lecture 4a


 Cache Design (ETH, Fall 2018)
 https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMh
ylk_d5dI-TM7&index=6

 Computer Architecture, Spring 2015, Lecture 19


 High Performance Caches (CMU, Spring 2015)
 https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=21

https://www.youtube.com/onurmutlulectures 124
Multi-Core Issues in Caching
126

DRAM BANKS
Caches in a Multi-Core System

DRAM INTERFACE
DRAM MEMORY
CORE 1

CORE 3
CONTROLLER
L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2

CORE 2
CORE 0
SHARED L3 CACHE
Caches in a Multi-Core System

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 127


Caches in a Multi-Core System

Intel Alder Lake,


2021
Source: https://twitter.com/Locuza_/status/1454152714930331652 128
Caches in a Multi-Core System

Core Count:
8 cores/16 threads

L1 Caches:
32 KB per core

L2 Caches:
512 KB per core

L3 Cache:
32 MB shared

AMD Ryzen 5000, 2020


https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/ 129
Caches in a Multi-Core System
AMD increases the L3 size of their 8-core Zen 3
processors from 32 MB to 96 MB

Additional 64 MB L3 cache die


stacked on top of the processor die
- Connected using Through Silicon Vias (TSVs)
https://community.microcenter.com/discussion/5
134/comparing-zen-3-to-zen-2
- Total of 96 MB L3 cache

https://youtu.be/gqAYMx34euU 130
https://www.tech-critter.com/amd-keynote-computex-2021/
3D Stacking Technology: Example

https://www.pcgameshardware.de/Ryzen-7-5800X3D-CPU-278064/Specials/3D-V -Cache-Release-1393125/ 131


Caches in a Multi-Core System
IBM POWER10,
2020

Cores:
15-16 cores,
8 threads/core

L2 Caches:
2 MB per core

L3 Cache:
120 MB shared

https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 132
Caches in a Multi-Core System

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 133
Caches in a Multi-Core System
Nvidia Hopper,
2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 134
Caches in Multi-Core Systems
 Cache efficiency becomes even more important in a multi-
core/multi-threaded system
 Memory bandwidth is at premium
 Cache space is a limited resource across cores/threads

 How do we design the caches in a multi-core system?

 Many decisions and questions


 Shared vs. private caches
 How to maximize performance of the entire system?
 How to provide QoS & predictable perf. to different threads in a shared cache?
 Should cache management algorithms be aware of threads?
 How should space be allocated to threads in a shared cache?
 Should we store data in compressed format in some caches?
 How do we do better reuse prediction & management in caches?
135
Private vs. Shared Caches
 Private cache: Cache belongs to one core (a shared block
can be in multiple caches)
 Shared cache: Cache is shared by multiple cores

CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE

DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER

136
Resource Sharing Concept and Advantages
 Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
 Example resources: functional units, pipeline, caches, buses,
memory, interconnects, storage
 Why?

+ Resource sharing improves utilization/efficiency  throughput


 When a resource is left idle by one thread, another thread can
use it; no need to replicate shared data
+ Reduces communication latency
 For example, data shared between multiple threads can be kept
in the same cache in multithreaded processors
+ Compatible with the shared memory programming model

137
Resource Sharing Disadvantages
 Resource sharing results in contention for resources
 When the resource is not idle, another thread cannot use it
 If space is occupied by one thread, another thread needs to re-
occupy it

- Sometimes reduces each or some thread’s performance


- Thread performance can be worse than when it is run alone
- Eliminates performance isolation  inconsistent performance
across runs
- Thread performance depends on co-executing threads
- Uncontrolled (free-for-all) sharing degrades quality of service
- Causes unfairness, starvation

Need to efficiently and fairly utilize shared resources


138
Private vs. Shared Caches
 Private cache: Cache belongs to one core (a shared block
can be in multiple caches)
 Shared cache: Cache is shared by multiple cores

CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE

DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER

139
Shared Caches Between Cores
 Advantages:
 High effective capacity
 Dynamic partitioning of available cache space
 No fragmentation due to static partitioning
 If one core does not utilize some space, another core can
 Easier to maintain coherence (a cache block is in a single location)

 Disadvantages
 Slower access (cache not tightly coupled with the core)
 Cores incur conflict misses due to other cores’ accesses
 Misses due to inter-core interference
 Some cores can destroy the hit rate of other cores
 Guaranteeing a minimum level of service (or fairness) to each core is harder
(how much space, how much bandwidth?)

140
Example: Problem with Shared Caches

Processor Core 1 ←t1 Processor Core 2

L1 $ L1 $

L2 $

……

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor


Architecture,” PACT 2004.
141
Example: Problem with Shared Caches

Processor Core 1 t2→ Processor Core 2

L1 $ L1 $

L2 $

……

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor


Architecture,” PACT 2004.
142
Example: Problem with Shared Caches

Processor Core 1 ←t1 t2→ Processor Core 2

L1 $ L1 $

L2 $

……

t2’s throughput is significantly reduced due to unfair cache sharing

Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor


Architecture,” PACT 2004.
143
Resource Sharing vs. Partitioning
 Sharing improves throughput
 Better utilization of space

 Partitioning provides performance isolation (predictable


performance)
 Dedicated space

 Can we get the benefits of both?

 Idea: Design shared resources such that they are efficiently


utilized, controllable and partitionable
 No wasted resource + QoS mechanisms for threads

144
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=17 145
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=29 146
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=30 147
Lectures on Multi-Core Cache Management
 Computer Architecture, Fall 2018, Lecture 18b
 Multi-Core Cache Management (ETH, Fall 2018)
 https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=29

 Computer Architecture, Fall 2018, Lecture 19a


 Multi-Core Cache Management II (ETH, Fall 2018)
 https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=30

 Computer Architecture, Fall 2017, Lecture 15


 Multi-Core Cache Management (ETH, Fall 2017)
 https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXY
FIZywZXCPl4M_&index=17

https://www.youtube.com/onurmutlulectures 148
Lectures on Memory Resource Management

https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=21 149
Lectures on Memory Resource Management
 Computer Architecture, Fall 2020, Lecture 11a
 Memory Controllers (ETH, Fall 2020)
 https://www.youtube.com/watch?v=TeG773OgiMQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=20
 Computer Architecture, Fall 2020, Lecture 11b
 Memory Interference and QoS (ETH, Fall 2020)
 https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=21
 Computer Architecture, Fall 2020, Lecture 13
 Memory Interference and QoS II (ETH, Fall 2020)
 https://www.youtube.com/watch?v=Axye9VqQT7w&list=PL5Q2soXY2Zi9xidyIgBxU
z7xRPS-wisBN&index=26
 Computer Architecture, Fall 2020, Lecture 2a
 Memory Performance Attacks (ETH, Fall 2020)
 https://www.youtube.com/watch?v=VJzZbwgBfy8&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=2

https://www.youtube.com/onurmutlulectures 150
Cache Coherence
Cache Coherence
 Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?

P1 P2

Interconnection Network

1000
x
Main Memory

152
The Cache Coherence Problem

P1 P2 ld r2, x

1000

Interconnection Network

1000
x
Main Memory

153
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 1000 1000

Interconnection Network

1000
x
Main Memory

154
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000


add r1, r2, r4
st x, r1

Interconnection Network

1000
x
Main Memory

155
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000 Should NOT


add r1, r2, r4 load 1000
st x, r1 ld r5, x

Interconnection Network

1000
x
Main Memory

156
Hardware Cache Coherence
 Basic idea:
 A processor/cache broadcasts its write/update to a memory
location to all other processors
 Another processor/cache that has the location either updates
or invalidates its local copy

157
A Very Simple Coherence Scheme (VI)
 Idea: All caches “snoop” (observe) each other’s write/read
operations. If a processor writes to a block, all others
invalidate the block.
 A simple protocol:
PrRd/-- PrWr / BusWr  Write-through, no-
write-allocate
cache
Valid  Actions of the local
BusWr processor on the
PrRd / BusRd cache block: PrRd,
PrWr,
Invalid  Actions that are
broadcast on the
PrWr / BusWr bus for the block:
BusRd, BusWr
158
Lecture on Cache Coherence

https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=38 159
Lecture on Memory Ordering & Consistency

https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=37 160
Lecture on Cache Coherence & Consistency
 Computer Architecture, Fall 2020, Lecture 21
 Cache Coherence (ETH, Fall 2020)
 https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=38

 Computer Architecture, Fall 2020, Lecture 20


 Memory Ordering & Consistency (ETH, Fall 2020)
 https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=37

 Computer Architecture, Spring 2015, Lecture 28


 Memory Consistency & Cache Coherence (CMU, Spring 2015)
 https://www.youtube.com/watch?v=JfjT1a0vi4E&list=PL5PHm2jkkXmi5CxxI7b3JCL
1TWybTDtKq&index=32

 Computer Architecture, Spring 2015, Lecture 29


 Cache Coherence (CMU, Spring 2015)
 https://www.youtube.com/watch?v=X6DZchnMYcw&list=PL5PHm2jkkXmi5CxxI7b3
JCL1TWybTDtKq&index=33
https://www.youtube.com/onurmutlulectures 161
Additional Slides:
Cache Coherence

162
Two Cache Coherence Methods
 How do we ensure that the proper caches are updated?

 Snoopy Bus [Goodman ISCA 1983, Papamarcos+ ISCA 1984]


 Bus-based, single point of serialization for all memory requests
 Processors observe other processors’ actions
 E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this
and invalidates its own copy of A

 Directory [Censier and Feautrier, IEEE ToC 1978]


 Single point of serialization per block, distributed among nodes
 Processors make explicit requests for blocks
 Directory tracks which caches have each block
 Directory coordinates invalidations and updates
 E.g.: P1 asks directory for exclusive copy, directory asks P0 to
invalidate, waits for ACK, then responds to P1
163
Directory Based Coherence
 Idea: A logically-central directory keeps track of where the
copies of each cache block reside. Caches consult this
directory to ensure coherence.

 An example mechanism:
 For each cache block in memory, store P+1 bits in directory
 One bit for each cache, indicating whether the block is in cache
 Exclusive bit: indicates that a cache has the only copy of the block
and can update it without notifying others
 On a read: set the cache’s bit and arrange the supply of data
 On a write: invalidate all caches that have the block and reset
their bits
 Have an “exclusive bit” associated with each block in each cache
(so that the cache can update the exclusive block silently)
164
Directory Based Coherence Example (I)

165
Directory Based Coherence Example (I)

166
Maintaining Coherence
 Need to guarantee that all processors see a consistent
value (i.e., consistent updates) for the same memory
location

 Writes to location A by P0 should be seen by P1


(eventually), and all writes to A should appear in some
order

 Coherence needs to provide:


 Write propagation: guarantee that updates will propagate
 Write serialization: provide a consistent order seen by all
processors for the same memory location

 Need a global point of serialization for this write ordering


167
Coherence: Update vs. Invalidate
 How can we safely update replicated data?
 Option 1 (Update protocol): push an update to all copies

 Option 2 (Invalidate protocol): ensure there is only one

copy (local), update it

 On a Read:
 If local copy is Invalid, put out request

 (If another node has a copy, it returns it, otherwise


memory does)

168
Coherence: Update vs. Invalidate (II)
 On a Write:
 Read block into cache as before

Update Protocol:
 Write to block, and simultaneously broadcast written
data and address to sharers
 (Other nodes update the data in their caches if block is
present)
Invalidate Protocol:
 Write to block, and simultaneously broadcast invalidation
of address to sharers
 (Other nodes invalidate block in their caches if block is
present)

169
Update vs. Invalidate Tradeoffs
 Which one is better? Update or invalidate?
 Write frequency and sharing behavior are critical
 Update
+ If sharer set is constant and updates are infrequent, avoids
the cost of invalidate-reacquire (broadcast update pattern)
- If data is rewritten without intervening reads by other cores,
updates would be useless
- Write-through cache policy  bus can become a bottleneck
 Invalidate
+ After invalidation, core has exclusive access rights
+ Only cores that keep reading after each write retain a copy
- If write contention is high, leads to ping-ponging (rapid
invalidation-reacquire traffic from different processors)

170
Additional Slides:
Memory Interference

171
Inter-Thread/Application Interference
 Problem: Threads share the memory system, but memory
system does not distinguish between threads’ requests

 Existing memory systems


 Free-for-all, shared based on demand
 Control algorithms thread-unaware and thread-unfair
 Aggressive threads can deny service to others
 Do not try to reduce or control inter-thread interference

172
Unfair Slowdowns due to Interference

matlab gcc
(Core 1)
(Core 0) (Core 2)
(Core 1)

Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service 173
Uncontrolled Interference: An Example

CORE
stream1 random2
CORE Multi-Core
Chip

L2 L2
CACHE CACHE
unfairness
INTERCONNECT
Shared DRAM
DRAM MEMORY CONTROLLER Memory System

DRAM DRAM DRAM DRAM


Bank 0 Bank 1 Bank 2 Bank 3

174
A Memory Performance Hog
// initialize large arrays A, B // initialize large arrays A, B

for (j=0; j<N; j++) { for (j=0; j<N; j++) {


index = j*linesize; streaming index = rand(); random
A[index] = B[index]; A[index] = B[index];
… …
} }

STREAM RANDOM
- Sequential memory access - Random memory access
- Very high row buffer locality (96% hit rate) - Very low row buffer locality (3% hit rate)
- Memory intensive - Similarly memory intensive

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

175
What Does the Memory Hog Do?

Row decoder
T0: Row 0
T0:
T1: Row 05
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Memory Request Buffer Row
Row 00 Row Buffer

Row size: 8KB, cache blockColumn mux


size: 64B
T0: STREAM
128 (8KB/64B)
T1: RANDOM
requests of T0 serviced
Data
before T1
Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

176
DRAM Controllers
 A row-conflict memory access takes significantly longer
than a row-hit access

 Current controllers take advantage of the row buffer

 Commonly used scheduling policy (FR-FCFS) [Rixner 2000]*


(1) Row-hit first: Service row-hit memory accesses first
(2) Oldest-first: Then service older accesses first

 This scheduling policy aims to maximize DRAM throughput


 But, it is unfair when multiple threads share the DRAM system

*Rixner et al., “Memory Access Scheduling,” ISCA 2000.


*Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

177
Effect of the Memory Performance Hog
3
2.82X slowdown
2.5

Slowdown 2

1.5 1.18X slowdown

0.5

0
STREAM RANDOM
Virtual
gcc PC

Results on Intel Pentium D running Windows XP


(Similar results for Intel Core Duo and AMD Turion, and on Fedora Linux)

Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

178
Greater Problem with More Cores

 Vulnerable to denial of service (DoS)


 Unable to enforce priorities or SLAs
 Low system performance
Uncontrollable, unpredictable system

179
Greater Problem with More Cores

 Vulnerable to denial of service (DoS)


 Unable to enforce priorities or SLAs
 Low system performance
Uncontrollable, unpredictable system

180
Distributed DoS in Networked Multi-Core Systems
Attackers Stock option pricing application
(Cores 1-8) (Cores 9-64)

Cores connected via


packet-switched
routers on chip

~5000X latency increase

Grot, Hestness, Keckler, Mutlu,


“Preemptive virtual clock: A Flexible,
Efficient, and Cost-effective QOS
Scheme for Networks-on-Chip,“
MICRO 2009.

181
More on Memory Performance Attacks
 Thomas Moscibroda and Onur Mutlu,
"Memory Performance Attacks: Denial of Memory Service
in Multi-Core Systems"
Proceedings of the 16th USENIX Security Symposium (USENIX
SECURITY), pages 257-274, Boston, MA, August 2007. Slides
(ppt)

182
http://www.youtube.com/watch?v=VJzZbwgBfy8
More on Interconnect Based Starvation
 Boris Grot, Stephen W. Keckler, and Onur Mutlu,
"Preemptive Virtual Clock: A Flexible, Efficient, and Cost-
effective QOS Scheme for Networks-on-Chip"
Proceedings of the 42nd International Symposium on
Microarchitecture (MICRO), pages 268-279, New York, NY,
December 2009. Slides (pdf)

183
Energy Comparison
of Memory Technologies
The Problem: Energy
 Faster is more energy-efficient
 SRAM, ~5 pJ
 DRAM, ~40-140 pJ
 PCM-DIMM (Intel Optane DC DIMM), ~80-540 pJ
 PCM-SSD, ~120 µJ
 Flash memory, ~250 µJ
 Hard Disk, ~60 mJ

 Other technologies have their place as well


 MRAM, RRAM, STT-MRAM, memristors, … (not mature yet)

185
The Problem (Table View): Energy
Bigger is slower Faster is more energy-efficient
Cost per Energy per Energy per
Memory Device Capacity Latency
Megabyte access byte access

SRAM < 1 KByte sub-nanosec


~5 pJ ~1.25 pJ
KByte~MB
SRAM ~nanosec < 0.3$
yte

DRAM Gigabyte ~50 nanosec < 0.006$ ~40-140 pJ ~10-35 pJ

PCM-DIMM
~300
(Intel Optane DC Gigabyte < 0.004$ ~80-540 pJ ~20-135 pJ
nanosec
DIMM)
PCM-SSD Gigabyte
~6-10 µs < 0.002$ ~120 µJ ~30 nJ
(Intel Optane SSD) ~Terabyte

Gigabyte
Flash memory ~50-100 µs < 0.00008$ ~250 µJ ~61 nJ
~Terabyte

Hard Disk Terabyte ~10 millisec < 0.00003$ ~60 mJ ~15 µJ

Faster is more expensive


($$$ and chip area)
These sample values (circa ~2022) scale with time
186
Basic Cache Examples:
For You to Study
Cache Terminology
 Capacity (C):
 the number of data bytes a cache stores
 Block size (b):
 bytes of data brought into cache at once
 Number of blocks (B = C/b):
 number of blocks in cache: B = C/b

 Degree of associativity (N):


 number of blocks in a set
 Number of sets (S = B/N):
 each memory address maps to exactly one cache set

188
How is data found?
 Cache organized into S sets

 Each memory address maps to exactly one set

 Caches categorized by number of blocks in a set:


 Direct mapped: 1 block per set
 N-way set associative: N blocks per set
 Fully associative: all cache blocks are in a single set

 Examine each organization for a cache with:


 Capacity (C = 8 words)

 Block size (b = 1 word)

 So, number of blocks (B = 8)

189
Direct Mapped Cache
Address
11...11111100 mem[0xFF...FC]
11...11111000 mem[0xFF...F8]
11...11110100 mem[0xFF...F4]
11...11110000 mem[0xFF...F0]
11...11101100 mem[0xFF...EC]
11...11101000 mem[0xFF...E8]
11...11100100 mem[0xFF...E4]
11...11100000 mem[0xFF...E0]

00...00100100 mem[0x00...24]
00...00100000 mem[0x00..20] Set Number
00...00011100 mem[0x00..1C] 7 (111)
00...00011000 mem[0x00...18] 6 (110)
00...00010100 mem[0x00...14] 5 (101)
00...00010000 mem[0x00...10] 4 (100)
00...00001100 mem[0x00...0C] 3 (011)
00...00001000 mem[0x00...08] 2 (010)
00...00000100 mem[0x00...04] 1 (001)
00...00000000 mem[0x00...00] 0 (000)

230 Word Main Memory 23 Word Cache


190
Direct Mapped Cache Hardware
Byte
Tag Set Offset
Memory
00
Address
27 3
V Tag Data

8-entry x
(1+27+32)-bit
SRAM

27 32

Hit Data

191
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5
Miss Rate =
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
done:

192
Direct Mapped Cache Performance
Byte
Tag Set Offset
Memory
00...00 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
1 00...00 mem[0x00...0C] Set 3 (011)
1 00...00 mem[0x00...08] Set 2 (010)
1 00...00 mem[0x00...04] Set 1 (001)
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5
Miss Rate = 3/15
loop: beq $t0, $0, done =
lw
lw
$t1, 0x4($0)
$t2, 0xC($0)
20%
lw $t3, 0x8($0)
addi $t0, $t0, -1
Temporal Locality
j loop Compulsory Misses
done:

193
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5
Miss Rate =
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:

194
Direct Mapped Cache: Conflict
Byte
Tag Set Offset
Memory
00...01 001 00
Address 3
V Tag Data
0 Set 7 (111)
0 Set 6 (110)
0 Set 5 (101)
0 Set 4 (100)
0 Set 3 (011)
0 Set 2 (010)
mem[0x00...04] Set 1 (001)
1 00...00 mem[0x00...24]
0 Set 0 (000)

# MIPS assembly code


addi $t0, $0, 5
Miss Rate = 10/10
loop: beq $t0, $0, done = 100%
lw $t1, 0x4($0)
lw $t2, 0x24($0) Conflict Misses
addi $t0, $t0, -1
j loop
done:

195
N-Way Set Associative Cache
Byte
Tag Set Offset
Memory
00
Address Way 1 Way 0
28 2
V Tag Data V Tag Data

28 32 28 32

= =

0
Hit1 Hit0 Hit1
32

Hit Data

196
N-way Set Associative Performance
# MIPS assembly code
Miss Rate =
addi $t0, $0, 5
loop: beq $t0, $0, done
lw $t1, 0x4($0)
lw $t2, 0x24($0)
addi $t0, $t0, -1
j loop
done:

Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0

197
N-way Set Associative Performance
# MIPS assembly code
Miss Rate = 2/10
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
= 20%
lw $t1, 0x4($0)
lw $t2, 0x24($0) Associativity reduces
addi $t0, $t0, -1 conflict misses
j loop
done:

Way 1 Way 0
V Tag Data V Tag Data
0 0 Set 3
0 0 Set 2
1 00...10 mem[0x00...24] 1 00...00 mem[0x00...04] Set 1
0 0 Set 0

198
Fully Associative Cache
 No conflict misses

 Expensive to build

V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data V Tag Data

199
Spatial Locality?
 Increase block size:
 Block size, b = 4 words
 C = 8 words
 Direct mapped (1 block per set)
 Number of blocks, B = C/b = 8/4 = 2
Block Byte
Tag Set Offset Offset
Memory
00
Address
27 2
V Tag Data
Set 1
Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data
200
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
Miss Rate =
lw $t1, 0x4($0)
lw $t2, 0xC($0)
lw $t3, 0x8($0)
addi $t0, $t0, -1
j loop
done:

Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data

201
Direct Mapped Cache Performance
loop:
addi
beq
$t0,
$t0,
$0, 5
$0, done
Miss Rate = 1/15
lw
lw
$t1,
$t2,
0x4($0)
0xC($0)
= 6.67%
lw $t3, 0x8($0)
addi $t0, $t0, -1 Larger blocks reduce
j loop compulsory misses through
done:
spatial locality
Block Byte
Tag Set Offset Offset
Memory
00...00 0 11 00
Address
27 2
V Tag Data
0 Set 1
1 00...00 mem[0x00...0C] mem[0x00...08] mem[0x00...04] mem[0x00...00] Set 0
27 32 32 32 32
11

10

01

00
32
=

Hit Data

202
Cache Organization Recap
 Main Parameters
 Capacity: C
 Block size: b
 Number of blocks in cache: B = C/b
 Number of blocks in a set: N
 Number of Sets: S = B/N

Number of Ways Number of Sets


Organization (N) (S = B/N)
Direct Mapped 1 B

N-Way Set Associative 1<N<B B/N

Fully Associative B 1

203
Capacity Misses
 Cache is too small to hold all data of interest at one time
 If the cache is full and program tries to access data X that is
not in cache, cache must evict data Y to make room for X
 Capacity miss occurs if program then tries to access Y again
 X will be placed in a particular set based on its address

 In a direct mapped cache, there is only one place to put X

 In an associative cache, there are multiple ways where X


could go in the set.

 How to choose Y to minimize chance of needing it again?


 Least recently used (LRU) replacement: the least recently
used block in a set is evicted when the cache is full.

204
Types of Misses
 Compulsory: first time data is accessed

 Capacity: cache too small to hold all data of interest

 Conflict: data of interest maps to same location in cache

 Miss penalty: time it takes to retrieve a block from lower


level of hierarchy

205
LRU Replacement
# MIPS assembly

lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0)

V U Tag Data V Tag Data Set Number


3 (11)
2 (10)
(a)
1 (01)
0 (00)

V U Tag Data V Tag Data Set Number


3 (11)
2 (10)
(b)
1 (01)
0 (00)

206
LRU Replacement
# MIPS assembly

lw $t0, 0x04($0)
lw $t1, 0x24($0)
lw $t2, 0x54($0)

Way 1 Way 0

V U Tag Data V Tag Data


0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 0 00...010 mem[0x00...24] 1 00...000 mem[0x00...04] Set 1 (01)
0 0 0 Set 0 (00)
(a)
Way 1 Way 0

V U Tag Data V Tag Data


0 0 0 Set 3 (11)
0 0 0 Set 2 (10)
1 1 00...010 mem[0x00...24] 1 00...101 mem[0x00...54] Set 1 (01)
0 0 0 Set 0 (00)
(b) 207
Slides for Future Lectures

208
Issues in Set-Associative Caches
 Think of each block in a set having a “priority”
 Indicating how important it is to keep the block in the cache
 Key issue: How do you determine/adjust block priorities?
 There are three key decisions in a set:
 Insertion, promotion, eviction (replacement)

 Insertion: What happens to priorities on a cache fill?


 Where to insert the incoming block, whether or not to insert the block
 Promotion: What happens to priorities on a cache hit?
 Whether and how to change block priority
 Eviction/replacement: What happens to priorities on a cache
miss?
 Which block to evict and how to adjust priorities
209
Eviction/Replacement Policy
 Which block in the set to replace on a cache miss?
 Any invalid block first
 If all are valid, consult the replacement policy
 Random
 FIFO
 Least recently used (how to implement?)
 Not most recently used
 Least frequently used?
 Least costly to re-fetch?
 Why would memory accesses have different cost?
 Hybrid replacement policies
 Optimal replacement policy?

210
Implementing LRU
 Idea: Evict the least recently accessed block
 Problem: Need to keep track of access ordering of blocks

 Question: 2-way set associative cache:


 What do you minimally need to implement LRU perfectly?

 Question: 4-way set associative cache:


 What do you minimally need to implement LRU perfectly?
 How many different orderings possible for the 4 blocks in the
set?
 How many bits needed to encode the LRU order of a block?
 What is the logic needed to determine the LRU victim?

 Repeat for N-way set associative cache


211
Approximations of LRU
 Most modern processors do not implement “true LRU” (also
called “perfect LRU”) in highly-associative caches

 Why?
 True LRU is complex
 LRU is an approximation to predict locality anyway (i.e., not
the best possible cache management policy)

 Examples:
 Not MRU (not most recently used)
 Hierarchical LRU: divide the N-way set into M “groups”, track
the MRU group and the MRU way in each group
 Victim-NextVictim Replacement: Only keep track of the victim
and the next victim
212
Cache Replacement Policy: LRU or Random
 LRU vs. Random: Which one is better?
 Example: 4-way cache, cyclic references to A, B, C, D, E
 0% hit rate with LRU policy
 Set thrashing: When the “program working set” in a set is
larger than set associativity
 Random replacement policy is better when thrashing occurs
 In practice:
 Performance of replacement policy depends on workload
 Average hit rate of LRU and Random are similar

 Best of both Worlds: Hybrid of LRU and Random


 How to choose between the two? Set sampling
 See Qureshi et al., ”A Case for MLP-Aware Cache Replacement,”
ISCA 2006.
213
What Is the Optimal Replacement Policy?
 Belady’s OPT
 Replace the block that is going to be referenced furthest in the
future by the program
 Belady, “A study of replacement algorithms for a virtual-storage
computer,” IBM Systems Journal, 1966.
 How do we implement this? Simulate?

 Is this optimal for minimizing miss rate?


 Is this optimal for minimizing execution time?
 No. Cache miss latency/cost varies from block to block!
 Two reasons: Where miss is serviced from and miss overlapping
 Qureshi et al. “A Case for MLP-Aware Cache Replacement,"
ISCA 2006.

214
Recommended Reading
 Key observation: Some misses more costly than others as their latency is
exposed as stall time. Reducing miss rate is not always good for
performance. Cache replacement should take into account cost of misses.

 Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt,


"A Case for MLP-Aware Cache Replacement"
Proceedings of the 33rd International Symposium on Computer
Architecture (ISCA), pages 167-177, Boston, MA, June 2006. Slides (ppt)

215
What’s In A Tag Store Entry?
 Valid bit
 Tag
 Replacement policy bits

 Dirty bit?
 Write back vs. write through caches

216
Handling Writes (I)
 When do we write the modified data in a cache to the next level?
 Write through: At the time the write happens
 Write back: When the block is evicted

 Write-back
+ Can combine multiple writes to the same block before eviction
 Potentially saves bandwidth between cache levels + saves energy
-- Need a bit in the tag store indicating the block is “dirty/modified”

 Write-through
+ Simpler design
+ All levels are up to date & consistent  Simpler cache coherence: no
need to check close-to-processor caches’ tag stores for presence
-- More bandwidth intensive; no combining of writes

217
Handling Writes (II)
 Do we allocate a cache block on a write miss?
 Allocate on write miss: Yes
 No-allocate on write miss: No

 Allocate on write miss


+ Can combine writes instead of writing each individually to next
level
+ Simpler because write misses can be treated the same way as
read misses
-- Requires transfer of the whole cache block

 No-allocate
+ Conserves cache space if locality of written blocks is low
(potentially better cache hit rate)
218
Handling Writes (III)
 What if the processor writes to an entire block over a small
amount of time?

 Is there any need to bring the block into the cache from
memory in the first place?

 Why do we not simply write to only a portion of the block,


i.e., subblock
 E.g., 4 bytes out of 64 bytes
 Problem: Valid and dirty bits are associated with the entire 64
bytes, not with each individual 4 bytes

219
Subblocked (Sectored) Caches
 Idea: Divide a block into subblocks (or sectors)
 Have separate valid and dirty bits for each subblock (sector)
 Allocate only a subblock (or a subset of subblocks) on a request

++ No need to transfer the entire cache block into the cache


(A write simply validates and updates a subblock)
++ More freedom in transferring subblocks into the cache (a
cache block does not need to be in the cache fully)
(How many subblocks do you transfer on a read?)

-- More complex design


-- May not exploit spatial locality fully
v d subblock v d subblock v d subblock tag
220
Instruction vs. Data Caches
 Separate or Unified?

 Pros and Cons of Unified:


+ Dynamic sharing of cache space: no overprovisioning that
might happen with static partitioning (i.e., separate I and D
caches)
-- Instructions and data can evict/thrash each other (i.e., no
guaranteed space for either)
-- I and D are accessed in different places in the pipeline. Where
do we place the unified cache for fast access?

 First level caches are almost always split


 Mainly for the last reason above – pipeline constraints
 Outer level caches are almost always unified
221
Multi-level Caching in a Pipelined Design
 First-level caches (instruction and data)
 Decisions very much affected by cycle time & pipeline structure
 Small, lower associativity; latency is critical
 Tag store and data store usually accessed in parallel
 Second- and third-level caches
 Decisions need to balance hit rate and access latency
 Usually large and highly associative; latency not as important
 Tag store and data store can be accessed serially

 Serial vs. Parallel access of levels


 Serial: Second level cache accessed only if first-level misses
 Second level does not see the same accesses as the first
 First level acts as a filter (filters some temporal and spatial locality)
 Management policies are therefore different
222
Deeper and Larger Cache Hierarchies

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 223


Deeper and Larger Cache Hierarchies

Intel Alder Lake,


2021
Source: https://twitter.com/Locuza_/status/1454152714930331652 224
Deeper and Larger Cache Hierarchies

Core Count:
8 cores/16 threads

L1 Caches:
32 KB per core

L2 Caches:
512 KB per core

L3 Cache:
32 MB shared

AMD Ryzen 5000, 2020


https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/ 225
AMD’s 3D Last Level Cache (2021)
AMD increases the L3 size of their 8-core Zen 3
processors from 32 MB to 96 MB

Additional 64 MB L3 cache die


stacked on top of the processor die
- Connected using Through Silicon Vias (TSVs)
https://community.microcenter.com/discussion/5
134/comparing-zen-3-to-zen-2
- Total of 96 MB L3 cache

https://youtu.be/gqAYMx34euU 226
https://www.tech-critter.com/amd-keynote-computex-2021/
Deeper and Larger Cache Hierarchies
IBM POWER10,
2020

Cores:
15-16 cores,
8 threads/core

L2 Caches:
2 MB per core

L3 Cache:
120 MB shared

https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 227
Deeper and Larger Cache Hierarchies

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 228
Deeper and Larger Cache Hierarchies
Nvidia Hopper, 2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad
https://wccftech.com/nvidia-hopper-gpus-featuring-mcm-technology-tape-out-soon-rumor/ 229
Deeper and Larger Cache Hierarchies
Nvidia Hopper,
2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 230
NVIDIA V100 & A100 Memory Hierarchy NVIDIA A100 Tensor Core GPU Architecture In-Depth

 Example of data movement between GPU global memory


(DRAM) and GPU cores.

A100 feature:
Direct copy from L2
A100 improves SM bandwidth efficiency with a new load-global-store-shared asynchronous copy to scratchpad,
instruction that bypasses L1 cache and register file (RF). Additionally, A100’s more efficient Tensor
Cores reduce shared memory (SMEM) loads.
bypassing L1 and
register file.
Figure 15. A100 SM Data Movement Efficiency
New asynchronous barriers work together with the asynchronous copy instruction to enable
https://images.nvidia.com/aem-dam/ en-zz/Solutions/data-center/nvidia-ampere-archit ecture-whitepaper. pdf 231
efficient data fetch pipelines, and A100 increases maximum SMEM allocation per SM 1.7x to
Memory in the NVIDIA H100 GPU
SM SM SM
Control Control Control

≈1 cycle Registers Registers Registers

Core Core Core Core Core Core Core Core


… Core Core Core Core

Core Core Core Core Core Core Core Core Core Core Core Core

≈5 cycles Constant Cache Constant Cache Constant Cache

Shared Shared Shared


≈5 cycles L1 Cache L1 Cache L1 Cache
Memory Memory Memory

SM-to-SM
Direct copy L2 Cache 60 MB

≈500 cycles Global Memory 3 TB/s 80 GB

Slide credit: Izzat El Hajj 232


Multi-Level Cache Design Decisions
 Which level(s) to place a block into (from memory)?

 Which level(s) to evict a block to (from an inner level)?

 Bypassing vs. non-bypassing levels

 Inclusive, exclusive, non-inclusive hierarchies


 Inclusive: a block in an inner level is always included also in
an outer level  simplifies cache coherence
 Exclusive: a block in an inner level does not exist in an outer
level  better utilizes space in the entire hierarchy
 Non-inclusive: a block in an inner level may or may not be
included in an outer level  relaxes design decisions
233
Cache Performance
Cache Parameters vs. Miss/Hit Rate
 Cache size

 Block size

 Associativity

 Replacement policy
 Insertion/Placement policy
 Promotion Policy

235
Cache Size
 Cache size: total data (not including tag) capacity
 bigger can exploit temporal locality better

 Too large a cache adversely affects hit and miss latency


 bigger is slower

 Too small a cache hit rate

 does not exploit temporal locality well


 useful data replaced often “working set”
size

 Working set: entire set of data


the executing application references
 Within a time interval cache size
236
Benefit of Larger Caches Widely Varies
 Benefits of cache size widely varies across applications
Misses per 1000 instructions

Low Utility Application


High Utility Application
Saturating Utility Application

Num ways from 16-way 1MB L2

Qureshi and Patt, “Utility-Based Cache Partitioning,” MICRO 2006. 237


Block Size
 Block size is the data that is associated with an address tag
 not necessarily the unit of transfer between hierarchies
 Sub-blocking: A block divided into multiple pieces (each w/ V/D bits)

hit rate
 Too small blocks
 do not exploit spatial locality well
 have larger tag overhead

 Too large blocks


 too few total blocks  exploit
temporal locality not well
block
 waste cache space and bandwidth/energy size
if spatial locality is not high
238
Large Blocks: Critical-Word and Subblocking
 Large cache blocks can take a long time to fill into the cache
 Idea: Fill cache block critical-word first
 Supply the critical data to the processor immediately

 Large cache blocks can waste bus bandwidth


 Idea: Divide a block into subblocks
 Associate separate valid and dirty bits for each subblock
 Recall: When is this useful?

v d subblock v d subblock v d subblock tag

239
Associativity
 How many blocks can be present in the same index (i.e., set)?

 Larger associativity
 lower miss rate (reduced conflicts)
 higher hit latency and area cost
hit rate
 Smaller associativity
 lower cost
 lower hit latency
 Especially important for L1 caches

associativity
 Is power of 2 associativity required?
240
Recall: Higher Associativity (4-way)
 4-way
Tag store

=? =? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits

241
Higher Associativity (3-way)
 3-way
Tag store

=? =? =?

Logic Hit?

Data store

MUX
byte in block
MUX Address
tag index byte in block
4 bits 1b 3 bits

242
Recall: 8-way Fully Associative Cache

Tag store

=? =? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits

243
7-way Fully Associative Cache

Tag store

=? =? =? =? =? =? =?

Logic

Hit?

Data store

MUX
byte in block
Address MUX
tag byte in block
5 bits 3 bits

244
Classification of Cache Misses
 Compulsory miss
 first reference to an address (block) always results in a miss
 subsequent references should hit unless the cache block is
displaced for the reasons below

 Capacity miss
 cache is too small to hold all needed data
 defined as the misses that would occur even in a fully-
associative cache (with optimal replacement) of the same
capacity

 Conflict miss
 defined as any miss that is neither a compulsory nor a
capacity miss
245
How to Reduce Each Miss Type
 Compulsory
 Caching (only accessed data) cannot help; larger blocks can
 Prefetching helps: Anticipate which blocks will be needed soon
 Conflict
 More associativity
 Other ways to get more associativity without making the
cache associative
 Victim cache
 Better, randomized indexing into the cache
 Software hints for eviction/replacement/promotion
 Capacity
 Utilize cache space better: keep blocks that will be referenced
 Software management: divide working set and computation
such that each “computation phase” fits in cache
246
How to Improve Cache Performance
 Three fundamental goals

 Reducing miss rate


 Caveat: reducing miss rate can reduce performance if more
costly-to-refetch blocks are evicted

 Reducing miss latency or miss cost

 Reducing hit latency or hit cost

 The above three together affect performance

247
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Better replacement/insertion policies
 Software approaches
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches

248
Software Approaches for Higher Hit Rate
 Restructuring data access patterns
 Restructuring data layout

 Loop interchange
 Data structure separation/merging
 Blocking
 …

249
Restructuring Data Access Patterns (I)
 Idea: Restructure data layout or data access patterns
 Example: If column-major
 x[i+1,j] follows x[i,j] in memory
 x[i,j+1] is far away from x[i,j]

Poor code Better code


for i = 1, rows for j = 1, columns
for j = 1, columns for i = 1, rows
sum = sum + x[i,j] sum = sum + x[i,j]

 This is called loop interchange


 Other optimizations can also increase hit rate
 Loop fusion, array merging, …
250
Restructuring Data Access Patterns (II)

 Blocking
 Divide loops operating on arrays into computation chunks so
that each chunk can hold its data in the cache
 Avoids cache conflicts between different chunks of
computation
 Essentially: Divide the working set so that each piece fits in
the cache

 Also called Tiling

251
Data Reuse: An Example from GPU Computing
 Same memory locations accessed by neighboring threads

Gaussian filter applied on


every pixel of an image

for (int i = 0; i < 3; i++){


for (int j = 0; j < 3; j++){
sum += gauss[i][j] * Image[(i+row-1)*width + (j+col-1)];
}
}

Lecture 22: GPU Programming (Spring 2018) https://www.youtube.com/watch?v=y40-tY5WJ8A 252


Data Reuse: Tiling in GPU Computing
 To take advantage of data reuse, we divide the input into tiles
that can be loaded into shared memory (scratchpad memory)

__shared__ int l_data[(L_SIZE+2)*(L_SIZE+2)];



Load tile into shared memory
__syncthreads();
for (int i = 0; i < 3; i++){
for (int j = 0; j < 3; j++){
sum += gauss[i][j] * l_data[(i+l_row-1)*(L_SIZE+2)+j+l_col-1];
}
}

Lecture 22: GPU Programming (Spring 2018) https://www.youtube.com/watch?v=y40-tY5WJ8A 253


Naïve Matrix Multiplication (I)
 Matrix multiplication: C = A x B
 Consider two input matrices A and B in row-major layout
 A size is M x P
B
 B size is P x N
 C size is M x N
k P

A C

i
M
k j

P N
254
Naïve Matrix Multiplication (II)
 Naïve implementation of matrix multiplication has poor
cache locality
#define A(i,j) matrix_A[i * P + j]
#define B(i,j) matrix_B[i * N + j]
#define C(i,j) matrix_C[i * N + j]

for (i = 0; i < M; i++){ // i = row index


for (j = 0; j < N; j++){ // j = column index
B
C(i, j) = 0; // Set to zero
for (k = 0; k < P; k++) // Row x Col
C(i, j) += A(i, k) * B(k, j);
k P
}
}

A C
Consecutive accesses to B are far from
i
each other, in different cache lines. M
Every access to B is likely to cause a k j
cache miss

P N
255
Tiled Matrix Multiplication (I)
 We can achieve better cache
locality by computing on B
smaller tiles or blocks that fit in
the cache k P
 Or in the scratchpad memory
and register file if we compute
on a GPU
A C

i
tile_dim

M
j
k
tile_dim

P N
Lam+, "The cache performance and optimizations of blocked algorithms," ASPLOS 1991. https://doi.org/10.1145/106972.106981
Bansal+, "Chapter 15 - Fast Matrix Computations on Heterogeneous Streams," in "High Performance Parallelism Pearls", 2015. https://doi.org/10.1016/B978-0-12-803819-2.00011-2
256
Kirk & Hwu, "Chapter 5 - Performance considerations," in "Programming Massively Parallel Processors (Third Edition)", 2017. https://doi.org/10.1016/B978-0-12-811986-0.00005-4
Tiled Matrix Multiplication (II)
 Tiled implementation operates on submatrices (tiles or
blocks) that fit fast memories (cache, scratchpad, RF)
#define A(i,j) matrix_A[i * P + j]
#define B(i,j) matrix_B[i * N + j]
#define C(i,j) matrix_C[i * N + j]

for (I = 0; I < M; I += tile_dim){


for (J = 0; J < N; J += tile_dim){
B
Set_to_zero(&C(I, J)); // Set to zero
for (K = 0; K < P; K += tile_dim)
Multiply_tiles(&C(I, J), &A(I, K), &B(K, J));
k P
}
}

A C
Multiply small submatrices (tiles or blocks) tile_dim
of size tile_dim x tile_dim i
M
k j
tile_dim

P N
Lam+, "The cache performance and optimizations of blocked algorithms," ASPLOS 1991. https://doi.org/10.1145/106972.106981
Bansal+, "Chapter 15 - Fast Matrix Computations on Heterogeneous Streams," in "High Performance Parallelism Pearls", 2015. https://doi.org/10.1016/B978-0-12-803819-2.00011-2
257
Kirk & Hwu, "Chapter 5 - Performance considerations," in "Programming Massively Parallel Processors (Third Edition)", 2017. https://doi.org/10.1016/B978-0-12-811986-0.00005-4
Tiled Matrix Multiplication on GPUs

Computer Architecture - Lecture 9: GPUs and GPGPU Programming (Fall 2017) https://youtu.be/mgtlbEqn2dA?t=8157 258
Restructuring Data Layout (I)
 Pointer based traversal
struct Node { (e.g., of a linked list)
struct Node* next; Frequently
int key; accessed  Assume a huge linked
char [256] name; Rarely list (1B nodes) and
char [256] school; accessed unique keys
}

while (node) {  Why does the code on


if (nodekey == input-key) { the left have poor cache
// access other fields of node
hit rate?
} Rarely
node = nodenext; accessed  “Other fields” occupy
} Frequently most of the cache line
accessed even though they are
rarely accessed!

259
Restructuring Data Layout (II)
struct Node {  Idea: separate rarely-
struct Node* next; accessed fields of a data
int key;
struct Node-data* node-data; structure and pack them into
} a separate data structure

struct Node-data {
char [256] name;
char [256] school;  Who should do this?
}
 Programmer
while (node) {  Compiler
if (nodekey == input-key) {  Profiling vs. dynamic
// access nodenode-data  Hardware?
}
node = nodenext;  Who can determine what is
} frequently accessed?

260
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity
 Better replacement/insertion policies
 Software approaches
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches

261
Miss Latency/Cost
 What is miss latency or miss cost affected by?

 Where does the miss get serviced from?


 What level of cache in the hierarchy?
 Row hit versus row conflict in DRAM (bank/rank/channel conflict)
 Queueing delays in the memory controller and the interconnect
 Local vs. remote memory (chip, node, rack, remote server, …)
 …

 How much does the miss stall the processor?


 Is it overlapped with other latencies?
 Is the data immediately needed by the processor?
 Is the incoming block going to evict a longer-to-refetch block?
 …
262
Memory Level Parallelism (MLP)

isolated miss parallel miss


B
A
C
time

 Memory Level Parallelism (MLP) means generating and


servicing multiple memory accesses in parallel [Glew’98]
 Several techniques to improve MLP (e.g., out-of-order execution)
 MLP varies. Some misses are isolated and some parallel
How does this affect cache replacement?
Traditional Cache Replacement Policies
 Traditional cache replacement policies try to reduce miss
count

 Implicit assumption: Reducing miss count reduces memory-


related stall time

 Misses with varying cost/MLP breaks this assumption!

 Eliminating an isolated miss helps performance more than


eliminating a parallel miss
 Eliminating a higher-latency miss could help performance
more than eliminating a lower-latency miss

264
An Example

P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3

Misses to blocks P1, P2, P3, P4 can be parallel


Misses to blocks S1, S2, and S3 are isolated

Two replacement algorithms:


1. Minimizes miss count (Belady’s OPT)
2. Reduces isolated miss (MLP-Aware)

For a fully associative cache containing 4 blocks


Fewest Misses = Best Performance

P4 P3
S1Cache
P2
S2 S3 P1
P4 S1
P3 S2
P2 P1
S3 P4P4P3S1P2
P4S2P1
P3S3P4
P2 P3
S1 P2P4S2P3 P2 S3

P4 P3 P2 P1 P1 P2 P3 P4 S1 S2 S3

Hit/Miss H H H M H H H H M M M
Misses=4
Time stall
Stalls=4
Belady’s OPT replacement

Hit/Miss H M M M H M M M H H H
Time Saved
stall Misses=6
cycles
Stalls=2
MLP-Aware replacement
Recommended: MLP-Aware Cache Replacement
 How do we incorporate MLP/cost into replacement decisions?
 How do we design a hybrid cache replacement policy?

 Qureshi et al., “A Case for MLP-Aware Cache Replacement,”


ISCA 2006.

267
Improving Basic Cache Performance
 Reducing miss rate
 More associativity
 Alternatives/enhancements to associativity
 Victim caches, hashing, pseudo-associativity, skewed associativity

 Better replacement/insertion policies


 Software approaches
 …
 Reducing miss latency/cost
 Multi-level caches
 Critical word first
 Subblocking/sectoring
 Better replacement/insertion policies
 Non-blocking caches (multiple cache misses in parallel)
 Multiple accesses per cycle
 Software approaches
 …
268
Lectures on Cache Optimizations (I)

https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=3 269
Lectures on Cache Optimizations (II)

https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=6 270
Lectures on Cache Optimizations (III)

https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=21 271
Lectures on Cache Optimizations
 Computer Architecture, Fall 2017, Lecture 3
 Cache Management & Memory Parallelism (ETH, Fall 2017)
 https://www.youtube.com/watch?v=OyomXCHNJDA&list=PL5Q2soXY2Zi9OhoVQBX
YFIZywZXCPl4M_&index=3

 Computer Architecture, Fall 2018, Lecture 4a


 Cache Design (ETH, Fall 2018)
 https://www.youtube.com/watch?v=55oYBm9cifI&list=PL5Q2soXY2Zi9JXe3ywQMh
ylk_d5dI-TM7&index=6

 Computer Architecture, Spring 2015, Lecture 19


 High Performance Caches (CMU, Spring 2015)
 https://www.youtube.com/watch?v=jDHx2K9HxlM&list=PL5PHm2jkkXmi5CxxI7b3J
CL1TWybTDtKq&index=21

https://www.youtube.com/onurmutlulectures 272
Multi-Core Issues in Caching
274

DRAM BANKS
Caches in a Multi-Core System

DRAM INTERFACE
DRAM MEMORY
CORE 1

CORE 3
CONTROLLER
L2 CACHE 1 L2 CACHE 3
L2 CACHE 0 L2 CACHE 2

CORE 2
CORE 0
SHARED L3 CACHE
Caches in a Multi-Core System

Apple M1,
2021

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested 275


Caches in a Multi-Core System

Intel Alder Lake,


2021
Source: https://twitter.com/Locuza_/status/1454152714930331652 276
Caches in a Multi-Core System

Core Count:
8 cores/16 threads

L1 Caches:
32 KB per core

L2 Caches:
512 KB per core

L3 Cache:
32 MB shared

AMD Ryzen 5000, 2020


https://wccftech.com/amd-ryzen-5000-zen-3-vermeer-undressed-high-res-die-shots-close-ups-pictured-detailed/ 277
Caches in a Multi-Core System
AMD increases the L3 size of their 8-core Zen 3
processors from 32 MB to 96 MB

Additional 64 MB L3 cache die


stacked on top of the processor die
- Connected using Through Silicon Vias (TSVs)
https://community.microcenter.com/discussion/5
134/comparing-zen-3-to-zen-2
- Total of 96 MB L3 cache

https://youtu.be/gqAYMx34euU 278
https://www.tech-critter.com/amd-keynote-computex-2021/
3D Stacking Technology: Example

https://www.pcgameshardware.de/Ryzen-7-5800X3D-CPU-278064/Specials/3D-V -Cache-Release-1393125/ 279


Caches in a Multi-Core System
IBM POWER10,
2020

Cores:
15-16 cores,
8 threads/core

L2 Caches:
2 MB per core

L3 Cache:
120 MB shared

https://www.it-techblog.de/ibm-power10-prozessor-mehr-speicher-mehr-tempo-mehr-sicherheit/09/2020/ 280
Caches in a Multi-Core System

Cores:
128 Streaming Multiprocessors

L1 Cache or
Scratchpad:
192KB per SM
Can be used as L1 Cache
and/or Scratchpad

L2 Cache:
40 MB shared

Nvidia Ampere, 2020


https://www.tomshardware.com/news/infrared-photographer-photos-nvidia-ga102-ampere-silicon 281
Caches in a Multi-Core System
Nvidia Hopper,
2022

Cores: L1 Cache or L2 Cache:


144 Streaming Scratchpad: 60 MB shared
Multiprocessors 256KB per SM
Can be used as L1 Cache
and/or Scratchpad

https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ 282
Caches in Multi-Core Systems
 Cache efficiency becomes even more important in a multi-
core/multi-threaded system
 Memory bandwidth is at premium
 Cache space is a limited resource across cores/threads

 How do we design the caches in a multi-core system?

 Many decisions and questions


 Shared vs. private caches
 How to maximize performance of the entire system?
 How to provide QoS & predictable perf. to different threads in a shared cache?
 Should cache management algorithms be aware of threads?
 How should space be allocated to threads in a shared cache?
 Should we store data in compressed format in some caches?
 How do we do better reuse prediction & management in caches?
283
Private vs. Shared Caches
 Private cache: Cache belongs to one core (a shared block
can be in multiple caches)
 Shared cache: Cache is shared by multiple cores

CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE

DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER

284
Resource Sharing Concept and Advantages
 Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it
 Example resources: functional units, pipeline, caches, buses,
memory
 Why?

+ Resource sharing improves utilization/efficiency  throughput


 When a resource is left idle by one thread, another thread can
use it; no need to replicate shared data
+ Reduces communication latency
 For example, data shared between multiple threads can be kept
in the same cache in multithreaded processors
+ Compatible with the shared memory programming model

285
Resource Sharing Disadvantages
 Resource sharing results in contention for resources
 When the resource is not idle, another thread cannot use it
 If space is occupied by one thread, another thread needs to re-
occupy it

- Sometimes reduces each or some thread’s performance


- Thread performance can be worse than when it is run alone
- Eliminates performance isolation  inconsistent performance
across runs
- Thread performance depends on co-executing threads
- Uncontrolled (free-for-all) sharing degrades QoS
- Causes unfairness, starvation

Need to efficiently and fairly utilize shared resources


286
Private vs. Shared Caches
 Private cache: Cache belongs to one core (a shared block
can be in multiple caches)
 Shared cache: Cache is shared by multiple cores

CORE 0 CORE 1 CORE 2 CORE 3 CORE 0 CORE 1 CORE 2 CORE 3

L2 L2 L2 L2
CACHE CACHE CACHE CACHE L2
CACHE

DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER

287
Shared Caches Between Cores
 Advantages:
 High effective capacity
 Dynamic partitioning of available cache space
 No fragmentation due to static partitioning
 If one core does not utilize some space, another core can
 Easier to maintain coherence (a cache block is in a single location)

 Disadvantages
 Slower access (cache not tightly coupled with the core)
 Cores incur conflict misses due to other cores’ accesses
 Misses due to inter-core interference
 Some cores can destroy the hit rate of other cores
 Guaranteeing a minimum level of service (or fairness) to each core is harder
(how much space, how much bandwidth?)

288
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXYFIZywZXCPl4M_&index=17 289
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=29 290
Lectures on Multi-Core Cache Management

https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQMhylk_d5dI-TM7&index=30 291
Lectures on Multi-Core Cache Management
 Computer Architecture, Fall 2018, Lecture 18b
 Multi-Core Cache Management (ETH, Fall 2018)
 https://www.youtube.com/watch?v=c9FhGRB3HoA&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=29

 Computer Architecture, Fall 2018, Lecture 19a


 Multi-Core Cache Management II (ETH, Fall 2018)
 https://www.youtube.com/watch?v=Siz86__PD4w&list=PL5Q2soXY2Zi9JXe3ywQM
hylk_d5dI-TM7&index=30

 Computer Architecture, Fall 2017, Lecture 15


 Multi-Core Cache Management (ETH, Fall 2017)
 https://www.youtube.com/watch?v=7_Tqlw8gxOU&list=PL5Q2soXY2Zi9OhoVQBXY
FIZywZXCPl4M_&index=17

https://www.youtube.com/onurmutlulectures 292
Lectures on Memory Resource Management

https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=21 293
Lectures on Memory Resource Management
 Computer Architecture, Fall 2020, Lecture 11a
 Memory Controllers (ETH, Fall 2020)
 https://www.youtube.com/watch?v=TeG773OgiMQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=20
 Computer Architecture, Fall 2020, Lecture 11b
 Memory Interference and QoS (ETH, Fall 2020)
 https://www.youtube.com/watch?v=0nnI807nCkc&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=21
 Computer Architecture, Fall 2020, Lecture 13
 Memory Interference and QoS II (ETH, Fall 2020)
 https://www.youtube.com/watch?v=Axye9VqQT7w&list=PL5Q2soXY2Zi9xidyIgBxU
z7xRPS-wisBN&index=26
 Computer Architecture, Fall 2020, Lecture 2a
 Memory Performance Attacks (ETH, Fall 2020)
 https://www.youtube.com/watch?v=VJzZbwgBfy8&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=2

https://www.youtube.com/onurmutlulectures 294
Cache Coherence
Cache Coherence
 Basic question: If multiple processors cache the same
block, how do they ensure they all see a consistent state?

P1 P2

Interconnection Network

1000
x
Main Memory
The Cache Coherence Problem

P1 P2 ld r2, x

1000

Interconnection Network

1000
x
Main Memory
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 1000 1000

Interconnection Network

1000
x
Main Memory
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000


add r1, r2, r4
st x, r1
Interconnection Network

1000
x
Main Memory
The Cache Coherence Problem

P1 P2 ld r2, x

ld r2, x 2000 1000 Should NOT


add r1, r2, r4 load 1000
st x, r1 ld r5, x

Interconnection Network

1000
x
Main Memory
A Very Simple Coherence Scheme (VI)
 Idea: All caches “snoop” (observe) each other’s write/read
operations. If a processor writes to a block, all others
invalidate the block.
 A simple protocol:
PrRd/-- PrWr / BusWr  Write-through, no-
write-allocate
cache
Valid  Actions of the local
BusWr processor on the
PrRd / BusRd cache block: PrRd,
PrWr,
Invalid  Actions that are
broadcast on the
PrWr / BusWr bus for the block:
BusRd, BusWr
301
Lecture on Cache Coherence

https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=38 302
Lecture on Memory Ordering & Consistency

https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=37 303
Lecture on Cache Coherence & Consistency
 Computer Architecture, Fall 2020, Lecture 21
 Cache Coherence (ETH, Fall 2020)
 https://www.youtube.com/watch?v=T9WlyezeaII&list=PL5Q2soXY2Zi9xidyIgBxUz7
xRPS-wisBN&index=38

 Computer Architecture, Fall 2020, Lecture 20


 Memory Ordering & Consistency (ETH, Fall 2020)
 https://www.youtube.com/watch?v=Suy09mzTbiQ&list=PL5Q2soXY2Zi9xidyIgBxUz
7xRPS-wisBN&index=37

 Computer Architecture, Spring 2015, Lecture 28


 Memory Consistency & Cache Coherence (CMU, Spring 2015)
 https://www.youtube.com/watch?v=JfjT1a0vi4E&list=PL5PHm2jkkXmi5CxxI7b3JCL
1TWybTDtKq&index=32

 Computer Architecture, Spring 2015, Lecture 29


 Cache Coherence (CMU, Spring 2015)
 https://www.youtube.com/watch?v=X6DZchnMYcw&list=PL5PHm2jkkXmi5CxxI7b3
JCL1TWybTDtKq&index=33
https://www.youtube.com/onurmutlulectures 304

You might also like