0% found this document useful (0 votes)
23 views

CS356Unit8 Memory Notes

The document discusses performance metrics in computing, focusing on latency and throughput, and the challenges posed by the memory wall due to the disparity in processor and memory speed advancements. It outlines various strategies for improving performance, including enhancing memory architecture and utilizing principles of locality through caching. Additionally, it covers memory organization, types of RAM, and the implications of memory technology on access times and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CS356Unit8 Memory Notes

The document discusses performance metrics in computing, focusing on latency and throughput, and the challenges posed by the memory wall due to the disparity in processor and memory speed advancements. It outlines various strategies for improving performance, including enhancing memory architecture and utilizing principles of locality through caching. Additionally, it covers memory organization, types of RAM, and the implications of memory technology on access times and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

8.1 8.

Performance Metrics
• Latency: Total time for a _____________ to complete
– Often hard to improve dramatically
– Example: Takes roughly 4 years to get your bachelor's
CS356 Unit 8 degree
– From perspective of an ______________

Memory • Throughput/Bandwidth: ____________ per


operation
– Usually much easier to improve by applying parallelism
– From perspective of the _________________
– Example: A university can graduate more students per
year by hiring more instructors or increasing class size

8.3 8.4

The Memory Wall Options for Improving Performance


• Problem: The Memory Wall • Focus on latency by improving the ______________________
– Processor speeds have been increasing much faster than – Can we improve the physical design of the basic memory circuits (i.e. the
memory access speeds (Memory technology targets density
circuit that remembers a single bit) to create faster RAMs?
rather than speed)
• This is ________________
– Large memories yield large address decode and _______ times
– Main memory is physically located on separate chips and
– Can we integrate memories on the ___________ as our processing logic?
sending signals between chips takes a ______________ than • Focus on _____________ by improving the
on the same chip
architecture/organization
– Within a single memory, can we organize it in a more efficient manner to
improve throughput
55%/year • DRAM organization, DDR SDRAM, etc.
Processor-Memory
Performance Gap – Can we use a ______________ of memories to make the most expensive
accesses far more rare
7%/year
• ________________
Hennessy and Patterson,
Computer Architecture –
A Quantitative Approach (2003)
– These are generally _________ to do than latency improvements
©Elsevier Science
8.5 8.6

Principle of Locality Memory Hierarchy & Caching


Program Code
func4:
movl (%rdi), %eax • General approach is to use several levels of faster and faster
• Most of the architectural improvements we make movl $1, %edx
jmp .L2 memory to hide delay of lower levels
will seek to ___________ the Principle of Locality .L4:
movslq %edx, %rcx
More
movl (%rdi,%rcx,4), %ecx Smaller Faster
– Explains why caching with a hierarchy of memories cmpl
jle
%ecx, %eax
.L3
Expensive
movl %ecx, %eax
Unit of Transfer:
yields improvement gain .L3: Word or Byte
addl $1, %edx
Registers
• Works in two dimensions .L2:
cmpl %esi, %edx
jl .L4
– _________ Locality: If an item is referenced, items ret
Higher
L1 Cache
Levels
whose addresses are _______ will tend to be ~ 1ns
referenced _____________ Arrays
Unit of Transfer:
L2 Cache Cache block/line
• Examples: ___________ and ______________ Lower ~ 10ns
1-8 words
(Take advantage of spatial
data[5] 0000 0002 0x00214 locality)
Levels
– __________ Locality: If an item is referenced, it will Main Memory
data[4] 0000 0001 0x00210
tend to be __________________________ ~ 100 ns
data[3] 0000 0002 0x0020c
• Examples: _______, repeatedly called __________, data[2] 0x00208 Secondary Storage
Unit of Transfer:
Page
0000 0001
setting a variable and then reusing it many times ~1-10 ms
4KB-64KB words
(Take advantage of
data[1] 0000 0002 0x00204
• 90/10 rule: Analysis shows that usually 10% of the spatial locality)
data[0] 0000 0001 0x00200
___________ instructions account for 90% of the Less
Larger Slower
_____________ instructions Expensive

8.7 8.8

Hierarchy Access Time & Sizes

MEMORY ORGANIZATION
8.9 8.10

Memory Array Translating Addresses to 2D Indices


System Bus Physical Address / Data bus
(Physical width
• Logical View = 1D array of rows (Dwords or Qwords) • While the programmer can varies)
width may be smaller/larger.

– Already this is 2D because each qword is 64-bits (i.e. (64) 1-bit columns) keep their view of a linear
(1D) address space, the
• Physical View = 2D array of rows and columns hardware will translate the 2 1 0 = col
0x0000000410
– Each row may contain 1000’s of columns (bits) though we have to access address into several indices ...

Processor
A
at least 8- (and often 16-, 32-, or 64-) bits at a time (row, column, etc.) by 40 …
1-byte
Values
__________ the address row 2
...
0x000800
D
... bits into ___________ 64 row 1
...
0x000400
...
C800 4DB2 2004 1023 0x0018 • Analogy: When you check row 0
...
0x000000
CC31 5EEF 89AB 97CD 0x0010 …
...
into a hotel you receive 1 Physical View of Memory
2830 FB50 AB49 82FE 0x0008 0x000800
...
number but portions of the
0001 ACDE 1234 89AB 0x0000 0x000400
number represent _______ Rank/Bank Row Col
... 0000 0000000001 0000010000 = 0x000410
0x000000
1D Logical View _________ (e.g. 612)
(Each row is a single – Floor: 6
qword = 64-bits) • Each cell represent an 8-bit byte
2D Physical View – Aisle: 1 • Address broken into fields to identify
(e.g. a row is 1KB = 8Kb) row/col/etc. (i.e. higher dimension indices)
– Room: 2

8.11 8.12

Memory Chip Organization


• Memory technologies share the 1K Bit Lines
same layout but differ in their BL[0] BL[1023]
cell implementation
– ___________ 1 1
Cell Cell
– ___________ WL[0]
• Memories require the row bits 10-bits
1 0
be sent first and are used to

0000000001
Row Cell Cell

Addr. Decoder
select one row (aka “____ line") WL[1]

Row
Addr
– Uses a hardware component SRAM and DRAM differ
known as a decoder in how each cell is

0x000410
Main memory organization • All cells in the selected row
made, but the
organization is roughly
the same
access their data bits and

0000010000
DRAM TECHNOLOGIES output them on their respective Cell
0 0
Cell

Col
“___________" WL[1023]

• The column address is sent next


and used to select the desired
8 bit lines (i.e. 1 byte) Amplifiers & Column Mux
Column
– Uses a hardware component Addr
known as a mux
Data[7:0] in/out
8.13 8.14

SRAM vs. DRAM Memory Controller


• Dynamic RAM (DRAM) Cells (store 1 bit) • DRAMs require non-trivial hardware
– Will _____________if not refreshed periodically controller (aka memory controller)
every few _______________ [i.e. dynamic] – To split up the address and send the
row and column address as the right
– Extremely small (_______________ & a capacitor) time
• Means we can have very high density (GB of RAM) – To periodically refresh the DRAM cells
– Small circuits require more time to access the bit – Plus more… Legacy architectures used separate
chipsets for the memory and I/O controller
• _______________ • Used to require a separate chip from
– Used for _________________ the processor
• Static RAM (SRAM) Cells (store 1 bit) • But due to scaling (i.e. Moore's Law)
– Will retain values as long as _____________ [i.e. most processors integrate the
static] controller on-chip
– Helps reduce access time since fewer
– Larger (___ transistors)
hops
– Larger circuitry can access bit FASTER
This Photo by Unknown Author Current general-purpose processors usually
– Used for __________ memory is licensed under CC BY-NC
integrate the memory controller on chip.

8.15 8.16

Implications of Memory Technology Legacy DRAM Timing


• Memory latency of a single access using • Can have only a single access “in-flight” at once
current DRAM technology will be slow • Memory controller must send row and column address
portions for each access
• We must improve bandwidth
/RAS Timing
– Idea 1: Access __________________ a single word /CAS Generator
Memory Array
at a time (to exploit spatial locality) tRC= Cycle Time (____ ns) = Time before next
access __________

Row Decoder
tRAC=Access Time (__ns) = Time until data is ____ Row
– Technology: Fast Page Mode, DDR SDRAM, etc. Address
tRC

– Idea 2: Increase number of accesses serviced in


____________________________ tRAC Column
Address
Column Muxes
– Technology: Banking
Legacy DRAM Data in / out
(Must present new Row/Column address for each access)
8.17 8.18

Fast Page Mode DRAM Timing Synchronous DRAM Timing


• Can provide _________________ addresses • Registers the column address and automatically increments it,
accessing n sequential data words in n successive clocks called
with only one row address _________… n=______ usually)
/RAS /RAS Timing CLK
Timing
Generator /CAS Generator
/CAS
Memory Array Memory Array

Row Decoder
Row

Row Decoder
Row
Address Address

Reg.
Reg.
Column
Address Column Latch/Register
Column

Reg/Cntr
Column Muxes
Address
Column Muxes

Data in / out
Fast Page Mode SDRAM (Synchronous DRAM) Data in / out
(Future address that fall in same row can Addition of clock signal. Will get up to ‘n’ consecutive
pull data from the latched row) words in the next ‘n’ clocks after column address is sent

8.19 8.20

DDR SDRAM Timing Key Point About Main Memory


• Double data rate access data every _____ clock • Time to access a sequential chunk of bytes in
cycle RAM (main memory) has two components
/RAS Timing CLK
/CAS Generator
Memory Array
– Time to find the ______ of a chunk (this is ______)
– Time to access each _________ byte (this is ______)
Row Decoder

Row
Address
• Accessing a chunk of N ___________ bytes is far
Reg.

faster than N ____________ bytes


Column Column Latch/Register
Reg/Cntr

Address
Column Muxes

Data in / out
DDR SDRAM (Double-Data Rate SDRAM)
Addition of clock signal. Will get up to ‘2n’ consecutive
words in the next ‘n’ clocks after column address is sent
8.21 8.22

Banking Bank Access Timing


• Divide memory into “banks” duplicating row/column decoder • Consecutive accesses to different banks can be __________
and other peripheral logic to create _________________ and hide the time to access the row and select the column
memory arrays that can access data in ___________ • Consecutive accesses within a bank (to different rows)
– uses a ___________ of the address to determine which bank to access
_____________ the access latency
Bank 0 Bank 1

CLK
Address Bank
Bank00
Row / Bank
Bank00
Column Bank 1Col Bank 2Col Bank 2Col
MC Address Row Row Row
Address
Bus 1 Access1 Access 2a
2a A Access 2b
2b b
Delay due to bank conflict
Bank 2 Bank 3
Data
Data Bus
Data 1 Data 2a Data 2b
= 0x004010 Access 1 maps to bank 1 while access 2a maps to bank 2
Row Bank Col allowing parallel access. However, access 2b immediately
Data
000000000100 00 0000010000 follows and maps to bank 2 causing a delay.

8.23 8.24

Programming Considerations
• For memory configuration given earlier, accesses to the same bank but different row
occur on an 32KB boundary
• Now consider a matrix multiply of 8K x 8K integer matrices (i.e. 32KB x 32KB)
• In code below…m2[0][0] @ 0x10010000 while m2[1][0] @ 0x10018000

Unused Row Bank Col.


A31-A29 A28…A15 A14,A13 A12…A0
0x10010000 00 1 0000 0000 0001 0 00 0000000000000
0x10018000 00 1 0000 0000 0001 1 00 0000000000000

int m1[8192][8192], m2[8192][8192], result[8192][8192];


int i,j,k;
CACHING
...
for(i=0; i < 8192; i++){
x for(j=0; j < 8192; j++){
result[i][j]=0;
m1 m2 for(k=0; k < 8192; k++){
result[i][j] += matrix1[i][k] * matrix2[k][j];
} } }
8.25 8.26

Cache Overview Cache Blocks/Lines


• Remember what registers are used for? • Whenever the processor
Processor Chip generates a read or a write,
– Quick access to copies of data
– Only a few (32 or 64) so that they can be
Registers
ALUs
it will first check the cache
%rax ALUs
memory to see if it Proc.
accessed really quickly PC/IP 800a5 %rsp

– Controlled by the software/compiler contains the desired data Cache forwards desired word
1 Request word @ 4
• Cache memory is a small-ish, (kilobytes to a – If so, it can get the data 0x400028 Subsequent access to
Cache Memory
quickly ___________ 5 any word in that block
few megabytes) "fast" memory usually built can be serviced by the
onto the processor chip – Otherwise, it must go to 2 cached copy (fast)
the _______ main Cache does not
have the data and
• Will hold _____________ of the memory to get the data requests whole
latest data & instructions (but subsequent accesses cache line 400020-
Bus 40003f 3 Memory responds
accessed by the processor can be serviced by the
0x400000
• Managed by the ________ 0x400040
cache) 0x400000
0x400040
– _____________ to the software 0x400080

0x4000c0
0x400100
0x400140
Memory (RAM)

8.27 8.28

Cache Blocks/Lines Cache and Locality


• To exploit spatial locality, • Caches take advantage of locality
cache memory is broken • Spatial Locality
into "_______" or Proc. – Caches do not store individual words but blocks of words
"_______" (a.k.a. "cache line" or "cache block")
– Any time data is brought in, Unit of Transfer: – Caches always bring in a block or line of sequential words
A Word because if we access one, we are likely to access the next
it will bring in the entire
block of data (to exploit 128B Cache – Bringing in blocks of sequential words takes advantage of
spatial locality) [4 blocks (lines) of _______________________ (i.e. __________, etc.)
8-words (32-bytes)]
– Blocks start on addresses
Unit of Transfer:
• Temporal Locality
__________ of their size A block – Leave data in the cache because it will likely be accessed
0x400000 again
Memory
Main

0x400040
0x400080
0x4000c0
0x400100
0x400140
8.29 8.30

Examples of Caching Used


• What is caching?
– Maintaining copies of information in locations that are
faster to access than their primary home
• Examples
– Data/instruction caches
– TLB
– Branch predictors
– VM IMPLEMENTATION ISSUES
– Web browser
– File I/O (disk cache)
– Internet name resolutions

8.31 8.32

Cache Definitions Primary Implementation Issues


• Cache ______ = Desired data is ____ current level of cache • Write Policies
– Can be further distinguished as read hit vs. write hit
• Cache ______ = Desired data is ____________ in current level • Replacement algorithms
– Can be further distinguished as read miss vs. write miss • Finding cached data (hit/miss)
• When a cache miss occurs, the new block is brought from the
– Mapping Algorithms
lower level into cache
– If cache is full, a block must be ____________ • Coherency (managing multiple versions)
• When CPU writes to cache, we may use one of two policies: – Discussed in future lectures
– Write Through (Store Through): Every write updates both _______
and _________ level of cache to keep them in sync. (i.e. coherent)
– Write Back: Let the CPU keep writing to cache at fast rate, not
updating the next level. Only ____________________ to the next
level when it needs to be replaced or flushed
8.33 8.34

Write Policies Write Through Cache


• On a write-hit how should • Write-through option:
we handle updating the – Update both levels of hierarchy
multiple copies that exist – Depending on hardware
Proc. Proc.
implementation, higher-level
(in cache and main may have to _______ for write
Write word (hit) Write word (hit)
memory)? to complete to lower level 1
– Later when block is evicted, ___
• Options: _______________ is needed 2 Cache and memory
copies are updated
– Update both – _______ writes require multiple
– Update 1 now and 1 at the main memory ________ On eviction, no
3
writeback is needed
end
0x400000 0x400000
Analogy: A movie star who 0x400040 Key Idea: Communicate EVERY 0x400040
changes their mind about what to 0x400080 change to main memory as they 0x400080
eat for lunch, and the assistant 0x4000c0 happen (keeps both copies in 0x4000c0
who has to communicate with the 0x400100 sync) 0x400100
chef 0x400140 0x400140

8.35 8.36

Write Back Cache Write-through vs. Writeback


• Write-back option:
• Write-through
– Update ______ cached copy – Pros: Keep both versions in synch at all times
– Processor can continue – Cons: Poor performance if next level of hierarchy is slow (see virtual memory) or if
quickly Proc. many, repeated accesses
– Later when block is evicted, Write word (hit) • Writeback
_______ block is written back 1
3 – Pros: Fast if many repeated accesses
(because bookkeeping is kept Cache updates – Cons:
on a per block basis) 2 value & signals • Coherency issues
processor to
– Notice that multiple writes 4 continue • Slow if few, isolated writes since entire block must be written back
only require ___ writeback On eviction, entire • In practice
upon eviction 5 block written back – Writeback must be used for lower levels of hierarchy where the next level is
0x400000 extremely slow
Key Idea: Communicate ONLY the
0x400040 – Even at higher levels writeback is often used (Most Intel L1 caches are
FINAL version of a block to main 0x400080
0x4000c0 writeback)
memory (when the block is evicted)
0x400100
0x400140
8.37 8.38

Replacement Policies
• On a read- or write-miss, a new block must be
brought in
• This requires evicting a current block residing
in the cache
• Replacement policies
– ______
MAPPINGS
– ______

– ______

8.39 8.40

Cache Question Cache Implementation


• Assume a cache of 4 blocks of 16-bytes each
Hi, I'm a block of cache
data. Can you tell me • Must store more than just data!
what address I came • What other bookkeeping and identification info is needed?
from? Memory / RAM

0xbfffeff0? 0x0080a1c4? 0x000


– Has the block been ____________
a184 beef 0781 8821
5621 930c e400 cc33 0x00f – Is the block ____________ or _____________
00 0a 56 c4 81 e0 fa ee a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f – ___________________ of the data: Where did I come from?
39 bf 53 e1 b8 00 ff 22
a184 beef 0781 8821 0x020
5621 930c e400 cc33
0x02f Cache

... Addr: 0x7c0-0x7cf a184 beef 0781 8821


Valid Modified 0x7c0-7cf
5621 930c e400 cc33
a184 beef 0781 8821 0x420
5621 930c e400 cc33 0x42f Addr: 0x470-0x47f a184 beef 0781 8821
Valid Unmodified 0470-47f
5621 930c e400 cc33
...
a184 beef 0781 8821
a184 beef 0781 8821 0x7a0 Empty - Empty
5621 930c e400 cc33
5621 930c e400 cc33 0x7af
a184 beef 0781 8821
... Empty - Empty
5621 930c e400 cc33
8.41 8.42

Implementation Terminology Identifying Blocks via Address Range


• Possible methods
• What bookkeeping values must be stored with the
– Store start and end address (requires multiple comparisons)
cache in addition to the block data? – Ensure block ranges sit on binary boundaries (upper address bits
• _____ – Portion of the block’s address range used to identify the block with a single value)
• Analogy: Hotel room layout/addressing
identify the MM block residing in the cache from 4 word (16-byte) blocks:
100 120 200 220
other MM blocks. 101 121 201 221 1st Digit = Floor Addr. Range Binary
2nd Digit = Aisle 000-00f 0000 0000 0000 -
• ______ bit – Indicates the block is occupied with 102 122 202 222
3rd Digit = Room w/in 1111
103 123 203 223 010-01f 0000 0001 0000 -
aisle
valid data (i.e. not empty or invalid) 1111

2nd Floor
104 124 204 224

1st Floor
• ______ bit – Indicates the cache and MM copies are 105
106
125
126
205
206
225
226
8 word (32-byte) blocks:
“inconsistent” (i.e. a write has been done to the 107 127 207 227 To refer to the range Addr. Range Binary

cached copy but not the main memory copy) 108 128 208 228 of rooms on the
second floor, left aisle
000-01f 0000 000 00000 -
11111
109 129 209 229
020-03f 0000 001 00000 -
– Used for ________________ caches Analogy: Hotel Rooms
we would just say
11111
rooms 20x

8.43 8.44

Cache Implementation Cache Implementation


• Assume 12-bit addresses and 16-byte blocks • To identify which MM block resides in each cache
• Block addresses will range from xx0-xxF block, the tags need to be stored along with the
– Address can be broken down as follows "dirty/modified" and "valid" bits Memory / RAM
– A[11:4] = Tag = Identifies block range (i.e. xx0-xxF)
a184 beef 0781 8821 0x000
– A[3:0] = Byte offset within the cache block 5621 930c e400 cc33 0x00f
a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f
Cache
a184 beef 0781 8821 0x020
A[11:4] A[3:0] 5621 930c e400 cc33
T=_________ a184 beef 0781 8821
0x7c0-7cf 0x02f
Tag Byte V=__ D=0 5621 930c e400 cc33
...
T=_________ a184 beef 0781 8821
V=__ D=1 0470-47f
5621 930c e400 cc33 a184 beef 0781 8821 0x470
5621 930c e400 cc33 0x47f
Addr. = 0x124 Addr. = 0xACC T=0111 1100 a184 beef 0781 8821
V=__ D=0 Empty
5621 930c e400 cc33 ...
Byte 4 w/in block Byte 12 w/in
120-12F block AC0-ACF T=0000 0000 a184 beef 0781 8821 a184 beef 0781 8821 0x7c0
V=__ D=0 Empty
5621 930c e400 cc33 5621 930c e400 cc33 0x7cf
...
8.45 8.46

Scenario Content-Addressable Memory


• You lost your keys • Cache memory is one form of what is known as “content-addressable”
memory
• You think back to where you have been lately
– This means data can be in any location in memory and does not have one
– You've been the library, to class, to grab food at campus center, and particular address
the gym – Additional information is saved with the data and is used to “address”/find the
– Where do you have to look to find your keys? desired data (this is the “tag” in this case) via a search on each access
• If you had been home all day and discovered your keys were – This search can be very ____________________________!!

missing, where would you have to look? Cache 2


Is block 0x470-
• Key lesson: If something can be anywhere you have to search T=0111 1100 a184 beef 0781 8821 0x47f here?
V=1 D=0 0x7c0-7cf
5621 930c e400 cc33
_________ Processor Core
1
%rip 0000 0000 0004 001b Read 0x47c T=0100 0111 a184 beef 0781 8821 or here?
– By contrast, if we limit where things can be then our search need only V=1 D=1 0470-47f
5621 930c e400 cc33
%rsp 0000 0000 7fff fff8
look in those ________________ places T=0100 0111 a184 beef 0781 8821
V=0 D=0 Empty
5621 930c e400 cc33
or here?

T=0000 0000 a184 beef 0781 8821


V=0 D=0 Empty
5621 930c e400 cc33 or here?

8.47 8.48

Tag Comparison Tag Comparison Example


• When caches have many blocks (> 16 or 32) it can be • Tag portion of desired address is check against all the
expensive (hardware-wise) to check all tags tags and qualified with the valid bits to determine a
hit
Processor Core Processor Core
%rip 0000 0000 0004 001b %rip 0000 0000 0004 001b
Address = 0x47c
Address = A[11:0] Byte Offset A[3:0] 0100 0111 1100 Byte Offset A[3:0] = 0xc
%rsp 0000 0000 7fff fff8 %rsp 0000 0000 7fff fff8
Tag = A[11:4] Tag = A[11:4] 0100 0111 1100
Cache Cache
= T=0111 1100 a184 beef 0781 8821 0 = T=0111 1100 a184 beef 0781 8821
& V=1 D=0 0x7c0-7cf
5621 930c e400 cc33
0
& V=1 D=0 0x7c0-7cf
5621 930c e400 cc33
V=1
= T=0100 0111 a184 beef 0781 8821 = T=0100 0111 a184 beef 0781 8821
& V=1 D=1 0470-47f
5621 930c e400 cc33
1
&
1
V=1 D=1 0470-47f
5621 930c e400 cc33
1 V=1
HIT/MISS OR = T=0100 0111 a184 beef 0781 8821 HIT/MISS OR = T=0100 1100 a184 beef 0781 8821
Empty 0 1 Empty
& V=0 D=0 5621 930c e400 cc33
& V=0
V=0 D=0 5621 930c e400 cc33

= T=0000 0000 a184 beef 0781 8821 When a block can be = T=0000 0000 a184 beef 0781 8821
V=0 D=0 Empty
5621 930c e400 cc33 ___________ you have 0 0
V=0 D=0 Empty
5621 930c e400 cc33
& to search ___________. & V=0
V
8.49 8.50

Mapping Techniques Fully Associative Mapping


• Determines where blocks can be __________ • Any block from memory can be put in any cache
in the cache block (i.e. no restriction)
– Implies we have to search _____________ to determine
• By reducing number of possible MM blocks hit or miss Memory / RAM

that map to a cache block, hit logic (searches) a184 beef 0781 8821
5621 930c e400 cc33
0x000
0x00f
can be done faster Cache a184 beef 0781 8821
5621 930c e400 cc33
0x010
0x01f
T=0111 1100 a184 beef 0781 8821
Cache Blk 0 a184 beef 0781 8821 0x020
• 3 Primary Methods V=1 D=0 5621 930c e400 cc33
5621 930c e400 cc33
0x02f
T=0100 0111 a184 beef 0781 8821
Cache Blk 1 ...
– ____________ Mapping V=1 D=1 5621 930c e400 cc33
a184 beef 0781 8821 0x420
T=0100 0111 a184 beef 0781 8821
Cache Blk 2 5621 930c e400 cc33 0x42f
– _______ Associative Mapping V=0 D=0 5621 930c e400 cc33

T=0000 0000 a184 beef 0781 8821 ...


Cache Blk 3
– _______-Associative Mapping V=0 D=0 5621 930c e400 cc33
a184 beef 0781 8821 0x7a0
5621 930c e400 cc33 0x7af
...

8.51 8.52

Direct Mapping K-way Set-Associative Mapping


• Each block from memory can only be put in one location • Given, S sets, block i of MM maps to set i mod S
• Given n cache blocks, • Within the set, block can be put anywhere
MM block i maps to cache block i mod n • Given N=total cache blocks, let K = number of cache blocks
per set = ____
Memory / RAM Memory / RAM
– ___ comparisons required for search
a184 beef 0781 8821 0x000 a184 beef 0781 8821 0x000
MM Blk 0
5621 930c e400 cc33 0x00f
= _ mod 4 MM Blk 0
5621 930c e400 cc33 0x00f
= _ mod 2
Cache a184 beef 0781 8821 0x010 Cache a184 beef 0781 8821 0x010
MM Blk 1
5621 930c e400 cc33 = _ mod 4 MM Blk 1
5621 930c e400 cc33 = _ mod 2
T=0111 11 a184 beef 0781 8821
0x01f T=0111 11 a184 beef 0781 8821
0x01f
V=1 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
MM Blk 2 0x020 = _ mod 4 V=1 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
MM Blk 2 0x020 = _ mod 2
5621 930c e400 cc33 Set 0 5621 930c e400 cc33
0x02f 0x02f
T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821 0x030 T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=1 D=1 Cache Blk 1
5621 930c e400 cc33
MM Blk 3
5621 930c e400 cc33 ... = _ mod 4 V=1 D=1 Cache Blk 1
5621 930c e400 cc33
MM Blk 3
5621 930c e400 cc33 ... = _ mod 2
0x03f 0x03f
T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 4 0x040 = _ mod 4 T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 4 0x040 = _ mod 2
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 5621 930c e400 cc33 0x04f V=0 D=0 Cache Blk 2
5621 930c e400 cc33 5621 930c e400 cc33 0x04f
a184 beef 0781 8821 0x050 Set 1 a184 beef 0781 8821 0x050
T=0000 00 a184 beef 0781 8821 MM Blk 5 = _ mod 4 T=0000 00 a184 beef 0781 8821 MM Blk 5 = _ mod 2
V=0 D=0 Cache Blk 3
5621 930c e400 cc33
5621 930c e400 cc33 0x05f V=0 D=0 Cache Blk 3
5621 930c e400 cc33
5621 930c e400 cc33 0x05f
a184 beef 0781 8821 0x060 a184 beef 0781 8821 0x060
MM Blk 6
5621 930c e400 cc33 0x06f
= _ mod 4 MM Blk 6
5621 930c e400 cc33 0x06f
= _ mod 2
... ...
8.53 8.54

Fully Associative Implementation Fully Associative Address Scheme


• Assume 12 address bits
Offset B=16 bytes per block
____________ offset bits
Determines byte/word within the block • Byte offset bits = ________ bits (B=Block Size)
Tag Remaining bits Identifies the MM address from where the block
came
• Tag = Remaining bits
Tag Offset Memory / RAM
Write 0x084 0000 1000 0100 a184 beef 0781 8821 0x000
5621 930c e400 cc33 0x00f
Cache a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f
T=0111 1100 a184 beef 0781 8821
V=1 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821 0x020
5621 930c e400 cc33
. 0x02f
T=0100 0111 a184 beef 0781 8821
V=1 D=1 Cache Blk 1
5621 930c e400 cc33 ...
. 0x080
T=0100 0111 a184 beef 0781 8821 a184 beef 0781 8821
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 . 5621 930c e400 cc33 0x08f
T=0000 0000 a184 beef 0781 8821 ...
V=0 D=0 Cache Blk 3
5621 930c e400 cc33
a184 beef 0781 8821 0xff0
5621 930c e400 cc33 0xfff
...

8.55 8.56

Fully Associative Mapping Fully Associative Mapping


Tag Offset

• Any block from memory can be put in any cache Write 0x004 0000 0000 0100

block (i.e. no mapping scheme)


• Completely flexible Memory / RAM Memory / RAM

a184 beef 0781 8821 0x000 a184 beef 0781 8821 0x000
5621 930c e400 cc33 0x00f 5621 930c e400 cc33 0x00f
Cache a184 beef 0781 8821 0x010 Cache a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f 5621 930c e400 cc33 0x01f
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
0x02f . 0x02f
T=0000 0000 a184 beef 0781 8821 a184 beef 0781 8821 0x030 T=0000 0000 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=0 D=1 Cache Blk 1
5621 930c e400 cc33 5621 930c e400 cc33 0x03f V=0 D=1 Cache Blk 1
5621 930c e400 cc33 . 5621 930c e400 cc33 0x03f
. ... ...
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 a184 beef 0781 8821 0xfc0 V=1 D=0 0x000-0x00f
5621 930c e400 cc33 . a184 beef 0781 8821 0xfc0
. 5621 930c e400 cc33 0xfcf 5621 930c e400 cc33 0xfcf
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 . a184 beef 0781 8821 0xfd0 V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0xfd0
5621 930c e400 cc33 0xfdf 5621 930c e400 cc33 0xfdf
a184 beef 0781 8821 0xfe0 Block 0 can go in any empty cache a184 beef 0781 8821 0xfe0
5621 930c e400 cc33 0 0xfef 5621 930c e400 cc33 0 0xfef
Tag Offset block, but let’s just pick cache block 2
a184 beef 0781 8821 F 0xff0 a184 beef 0781 8821 F 0xff0
Access 0x004 00000000 0100 5621 930c e400 cc33 0xfff 5621 930c e400 cc33 0xfff
8.57 8.58

Fully Associative Mapping Fully Associative Mapping


Tag Offset Tag Offset
Write 0x004 0000 0000 0100 Write 0x004 0000 0000 0100
Read 0x018 0000 0001 1000 Read 0x018 0000 0001 1000
Read 0xfe0 1111 1110 0000 Read 0xfe0 1111 1110 0000
Read 0xffc 1111 1111 1100 Memory / RAM Read 0xffc 1111 1111 1100 Memory / RAM

0x000 Read 0xfc4 1111 1100 0100 0x000


a184 beef 0781 8821 a184 beef 0781 8821
5621 930c e400 cc33 0x00f 5621 930c e400 cc33 0x00f
Cache a184 beef 0781 8821 0x010 Cache a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f 5621 930c e400 cc33 0x01f
T=1111 1110 a184 beef 0781 8821 T=1111 1110 a184 beef 0781 8821
V=1 D=0 0xfe0-0xfef
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
V=1 D=0 0xfe0-0xfef
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
. 0x02f . 0x02f
T=1111 1111 a184 beef 0781 8821 a184 beef 0781 8821 0x030 T=1111 1111 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=1 D=0 0xff0-0xfff
5621 930c e400 cc33 . 5621 930c e400 cc33 0x03f V=1 D=0 0xff0-0xfff
5621 930c
a184e400
beefcc33
0781 8821 . 5621 930c e400 cc33 0x03f
...
0x000-0x00f
5621 930c e400 cc33 ...
T=0000 0000 a184 beef 0781 8821 T=1111 1100 a184 beef 0781 8821
V=1 D=1 0x000-0x00f
5621 930c e400 cc33 . a184 beef 0781 8821 0xfc0 V=1 D=0 0xfc0-0xfcf
5621 930c e400 cc33 . a184 beef 0781 8821 0xfc0
5621 930c e400 cc33 0xfcf 5621 930c e400 cc33 0xfcf
T=0000 0001 a184 beef 0781 8821 T=0000 0001 a184 beef 0781 8821
V=1 D=0 0x010-0x01f
5621 930c e400 cc33 a184 beef 0781 8821 0xfd0 V=1 D=0 0x010-0x01f
5621 930c e400 cc33 a184 beef 0781 8821 0xfd0
5621 930c e400 cc33 0xfdf 5621 930c e400 cc33 0xfdf
Blocks can go anywhere so the next 3 a184 beef 0781 8821 0xfe0 Now cache is full so when we access a new block a184 beef 0781 8821 0xfe0
5621 930c e400 cc33 0 0xfef 5621 930c e400 cc33 0 0xfef
accesses will prefer to fill in empty (0xfc0-0xfcf) we have to evict a block from cache. Let
a184 beef 0781 8821 F 0xff0 a184 beef 0781 8821 F 0xff0
blocks us pick the Least Recently Used (LRU). Since it is
5621 930c e400 cc33 0xfff 5621 930c e400 cc33 0xfff
dirty/modified we must write 0x000-0x00f back to MM

8.59 8.60

Direct Mapping Direct Mapping Address Scheme


• Each block from memory can only be put in one location • Byte offset bits = log2B bits (B=Block Size)
• Given N total cache blocks,
MM block i maps to cache block _________ • Block bits = _________ (N=# of Cache Blocks)
Memory / RAM • Tag = Remaining bits
a184 beef 0781 8821 0x000
MM Blk 0
5621 930c e400 cc33 0x00f
= 0 mod 4
Cache a184 beef 0781 8821 0x010
MM Blk 1
5621 930c e400 cc33 = 1 mod 4
T=0111 11 a184 beef 0781 8821
0x01f
V=1 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
MM Blk 2 0x020 = 2 mod 4
5621 930c e400 cc33
. 0x02f
T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=1 D=1 Cache Blk 1
5621 930c e400 cc33
MM Blk 3
5621 930c e400 cc33 ... = 3 mod 4
. 0x03f
T=0100 01 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 4 0x040 = 0 mod 4
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 . 5621 930c e400 cc33 0x04f
a184 beef 0781 8821 0x050
T=0000 00 a184 beef 0781 8821 MM Blk 5 = 1 mod 4
V=0 D=0 Cache Blk 3
5621 930c e400 cc33
5621 930c e400 cc33 0x05f
a184 beef 0781 8821 0x060
MM Blk 6
5621 930c e400 cc33 0x06f
= 2 mod 4
...
8.61 8.62

Direct Mapping Implementation Direct Mapping Implementation


• Assume 12 address bits • Assume 12 address bits
Tag Block Offset Tag Block Offset
Offset B=16 bytes per block Determines byte/word Offset B=16 bytes per block Determines byte/word
Write 0x084 0100 Write 0x084 0000 10 00 0100
log2B = 4 offset bits within the block log2B = 4 offset bits within the block
Block N=4 blocks in the cache Performs hash function Block N=4 blocks in the cache Performs hash function
__________ block bits (i mod N) log2N = 2 block bits (i mod N)
Tag Remaining bits Identifies blocks that Tag Remaining bits Identifies blocks that
map to the same bucket map to the same bucket
(block 0, 4, 8, …) (block 0, 4, 8, …)
Cache Memory / RAM Cache Memory / RAM

T=1111 Address
10 = 080
a184 beef 0781 8821 a184 beef 0781 8821 0x080 T=0000 Address = 080 a184 beef 0781 8821 0x080
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4 10
V=1 D=1
a184 beef 0781 8821
0x080-0x08f
5621 930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4
. a184 beef 0781 8821 0x090 . a184 beef 0781 8821 0x090
T=0100 01 a184 beef 0781 8821
Cache Blk 1 MM Blk 09
5621 930c e400 cc33 = 1 mod 4 T=0100 01 a184 beef 0781 8821
Cache Blk 1 MM Blk 09
5621 930c e400 cc33 = 1 mod 4
V=0 D=0 5621 930c e400 cc33
. 0x09f V=0 D=0 5621 930c e400 cc33 . 0x09f
T=0100 01 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
T=0100 01 a184 beef 0781 8821
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 .
5621 930c e400 cc33
0x0af V=0 D=0 Cache Blk 2
5621 930c e400 cc33 .
5621 930c e400 cc33
0x0af
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
T=0000 00 a184 beef 0781 8821 5621 930c e400 cc33 ... 5621 930c e400 cc33 ...
0x0bf T=0000 00 a184 beef 0781 8821 0x0bf
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0 V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0
MM Blk 0c
5621 930c e400 cc33
= 0 mod 4 MM Blk 0c
5621 930c e400 cc33
= 0 mod 4
0x0cf 0x0cf
... Block 0x080-0x08f hashes/maps to cache block 0 and ...
thus must be placed there

8.63 8.64

Direct Mapping Implementation Direct Mapping Implementation


• Assume 12 address bits • Assume 12 address bits
Tag Block Offset Tag Block Offset
Offset B=16 bytes per block Determines byte/word Offset B=16 bytes per block Determines byte/word
Write 0x084 0000 10 00 0100 Write 0x084 0000 10 00 0100
log2B = 4 offset bits within the block log2B = 4 offset bits within the block
Read 0x09c 0000 __ __ 1100 Read 0x09c 0000 10 01 1100
Block N=4 blocks in the cache Performs hash function Block N=4 blocks in the cache Performs hash function
Read 0x0b8 0000 __ __ 1000 Read 0x0b8 0000 10 11 1000
log2N = 2 block bits (i mod N) log2N = 2 block bits (i mod N)
Read 0x0c8 0000 __ __ 1000
Tag Remaining bits Identifies blocks that Tag Remaining bits Identifies blocks that
map to the same bucket map to the same bucket
(block 0, 4, 8, …) (block 0, 4, 8, …)

Cache Memory / RAM Cache Memory / RAM


a184 beef 0781 8821
0x080-0x08f
T=0000 Address = 080 T=0000 Address = 080
0x080 5621 930c e400 cc33 0x080
a184 beef 0781 8821 a184 beef 0781 8821
10
V=1 D=1
a184 beef 0781
0x080-0x08f 8821
5621 930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4 11
V=1 D=1
a184 beef
5621
0781 8821
0x0c0-0x0cf
930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4
. a184 beef 0781 8821 0x090 a184 beef 0781 8821 0x090
T=0000 10 a184 beef 0781 8821
0x090-0x09f MM Blk 09
5621 930c e400 cc33 = 1 mod 4 T=0000 10 a184 beef 0781 8821
0x090-0x09f MM Blk 09
5621 930c e400 cc33 = 1 mod 4
V=1 D=0 5621 930c e400 cc33 . 0x09f V=1 D=0 5621 930c e400 cc33 0x09f
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
T=0100 01 a184 beef 0781 8821 5621 930c e400 cc33 T=0100 01 a184 beef 0781 8821 5621 930c e400 cc33
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 . 0x0af V=0 D=0 Cache Blk 2
5621 930c e400 cc33 0x0af
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
5621 930c e400 cc33 ... 5621 930c e400 cc33 ...
T=0000 10 a184 beef 0781 8821 0x0bf T=0000 10 a184 beef 0781 8821 0x0bf
V=1 D=0 0x0b0-0x0bf
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0 V=1 D=0 0x0b0-0x0bf
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0
MM Blk 0c
5621 930c e400 cc33
= 0 mod 4 MM Blk 0c
5621 930c e400 cc33
= 0 mod 4
0x0cf 0x0cf
Other blocks must be placed where they hash which is Even though cache block 2 is open, accessing block 0x0c0-0x0cf
... must be placed in cache block 0, replacing the previous block
...
computed by simply using the block bits
8.65 8.66

K-way Set-Associative Mapping K-Way Set Associative Address Scheme


• Given, S sets, block i of MM maps to set i mod s
• Within the set, block can be put anywhere
• Byte offset bits = log2B bits (B=Block Size)
• Given N=total cache blocks, let K = number of cache blocks • Set bits = ________ bits (S=# of Cache Sets)
per set = N/S
Memory / RAM • Tag = Remaining bits
– K comparisons required for search
a184 beef 0781 8821 0x000
MM Blk 0
5621 930c e400 cc33 0x00f
= 0 mod 2
Cache a184 beef 0781 8821 0x010
MM Blk 1
5621 930c e400 cc33 = 1 mod 2
T=0111 101 a184 beef 0781 8821
0x01f
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
MM Blk 2 0x020 = 0 mod 2
Set 0 5621 930c e400 cc33
. 0x02f
T=0100 001 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=0 D=1 Cache Blk 1
5621 930c e400 cc33
MM Blk 3
5621 930c e400 cc33 ... = 1 mod 2
. 0x03f
T=0100 001 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 4 0x040 = 0 mod 2
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 . 5621 930c e400 cc33 0x04f
Set 1 a184 beef 0781 8821 0x050
T=0000 000 a184 beef 0781 8821 MM Blk 5 = 1 mod 2
V=0 D=0 Cache Blk 3
5621 930c e400 cc33
5621 930c e400 cc33 0x05f
a184 beef 0781 8821 0x060
MM Blk 6
5621 930c e400 cc33 0x06f
= 0 mod 2
...

8.67 8.68

K-way Set-Associative Mapping K-way Set-Associative Mapping


• Assume 12-bit addresses Tag Set Offset • Assume 12-bit addresses Tag Set Offset
Offset B=16 bytes per block Determines byte/word Write 0x084 0100 Offset B=16 bytes per block Determines byte/word Write 0x084 0000 100 0 0100
log2B = 4 offset bits within the block log2B = 4 offset bits within the block
Set S=________ sets Performs hash function Set S=N/K=2 sets Performs hash function
__________ set bit(s) (i mod S) log2S = 1 set bit (i mod S)
Tag Remaining bits Identifies blocks that map to Tag Remaining bits Identifies blocks that map to
the same bucket (block the same bucket (block
0x00,…, 0x08, 0x0a, 0x0c, …) 0x00,…, 0x08, 0x0a, 0x0c, …)
Cache Memory / RAM Cache Memory / RAM

T=0111 101 a184 beef 0781 8821


Cache Blk 0 a184 beef 0781 8821 0x080 T=0000 100 a184 beef 0781 8821
0x080-0x08f a184 beef 0781 8821 0x080
V=0 D=0 5621 930c e400 cc33 MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4 V=1 D=1 5621 930c e400 cc33 MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4
Set 0 Set 0
T=0100 001 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 09 0x090 T=0100 001 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 09 0x090
V=0 D=1 Cache Blk 1
5621 930c e400 cc33 5621 930c e400 cc33 0x09f
= 1 mod 4 V=0 D=1 Cache Blk 1
5621 930c e400 cc33 5621 930c e400 cc33 0x09f
= 1 mod 4
T=0100 001 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4 T=0100 001 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
V=0 D=0 Cache Blk 2
5621 930c e400 cc33
5621 930c e400 cc33
0x0af V=0 D=0 Cache Blk 2
5621 930c e400 cc33
5621 930c e400 cc33
0x0af
Set 1 a184 beef 0781 8821 0x0b0 Set 1 a184 beef 0781 8821 0x0b0
T=0000 000 a184 beef 0781 8821 MM Blk 0b
5621 930c e400 cc33 ... = 3 mod 4 T=0000 000 a184 beef 0781 8821 MM Blk 0b
5621 930c e400 cc33 ... = 3 mod 4
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 0x0bf V=0 D=0 Cache Blk 3
5621 930c e400 cc33 0x0bf
a184 beef 0781 8821
MM Blk 0c 0x0c0 = 0 mod 4 a184 beef 0781 8821
MM Blk 0c 0x0c0 = 0 mod 4
5621 930c e400 cc33 0x0cf 5621 930c e400 cc33 0x0cf
Block 0x080-0x8f maps to set 0 but can be placed anywhere in Block 0x080-0x8f maps to set 0 but can be placed anywhere in
that set (i.e. cache block 0 or 1)
... that set (i.e. cache block 0 or 1)…we'll just choose cache block 0.
...
8.69 8.70

K-way Set-Associative Mapping K-way Set-Associative Mapping


• Assume 12-bit addresses Tag Set Offset • Assume 12-bit addresses Tag Set Offset
Offset B=16 bytes per block Determines byte/word Write 0x084 0000 100 0 0100 Offset B=16 bytes per block Determines byte/word Write 0x084 0000 100 0 0100
log2B = 4 offset bits within the block Read 0x0b0 0000 ___ _ 0000
log2B = 4 offset bits within the block Read 0x0b0 0000 101 1 0000
Set S=N/K=2 sets Performs hash function Read 0x0c8 0000 ___ _ 1000 Set S=N/K=2 sets Performs hash function Read 0x0c8 0000 110 0 1000
log2S = 1 set bit (i mod S) log2S = 1 set bit (i mod S) Read 0x0a4 0000 ___ _ 0100
Tag Remaining bits Identifies blocks that map to Tag Remaining bits Identifies blocks that map to
the same bucket (block the same bucket (block
0x00,…, 0x08, 0x0a, 0x0c, …) 0x00,…, 0x08, 0x0a, 0x0c, …)
Cache Memory / RAM Cache Memory / RAM
a184 beef 0781 8821
0x080-0x08f
5621 930c e400 cc33
T=0000 100 a184 beef 0781 8821
0x080-0x08f a184 beef 0781 8821 0x080 T=0000 101 a184 beef 0781 8821
0x0a0-0x0af a184 beef 0781 8821 0x080
V=1 D=1 5621 930c e400 cc33 MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4 V=1 D=0 5621 930c e400 cc33 MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4
Set 0 Set 0
T=0000 110 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 09 0x090 T=0000 110 a184 beef 0781 8821 a184 beef 0781 8821
MM Blk 09 0x090
V=1 D=0 0x0c0-0x0cf
5621 930c e400 cc33 5621 930c e400 cc33 0x09f
= 1 mod 4 V=1 D=0 0x0c0-0x0cf
5621 930c e400 cc33 5621 930c e400 cc33 0x09f
= 1 mod 4
T=0000 101 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4 T=0000 101 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
V=1 D=0 0x0b0-0x0bf
5621 930c e400 cc33
5621 930c e400 cc33
0x0af V=1 D=0 0x0b0-0x0bf
5621 930c e400 cc33
5621 930c e400 cc33
0x0af
Set 1 a184 beef 0781 8821 0x0b0 Set 1 a184 beef 0781 8821 0x0b0
T=0000 000 a184 beef 0781 8821 MM Blk 0b
5621 930c e400 cc33 ... = 3 mod 4 T=0000 000 a184 beef 0781 8821 MM Blk 0b
5621 930c e400 cc33 ... = 3 mod 4
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 0x0bf V=0 D=0 Cache Blk 3
5621 930c e400 cc33 0x0bf
a184 beef 0781 8821
MM Blk 0c 0x0c0 = 0 mod 4 a184 beef 0781 8821
MM Blk 0c 0x0c0 = 0 mod 4
5621 930c e400 cc33 0x0cf 5621 930c e400 cc33 0x0cf
Block 0x0b0-0x0bf maps to set 1 and can be placed in either of Block 0x0a0-0x0af hashes to set 0 which is full. We'll pick the LRU
the blocks of set 1. Block 0x0c0-0x0cf maps to set 0 and will be ... block (0x080-0x08f) which requires a writeback. Then we can ...
place in the remaining free block of set 0 bring in 0x0a0-0x0af

8.71 8.72

Summary of Mapping Schemes Address Mapping Examples


• Fully associative 31
Tag Offset
0
MM • 16-bit addresses, 2 KB cache,
Addr
– Most flexible ______________ Fully Associative 32 bytes/block
No hashing…can be placed
– _________ search time ____ anywhere in cache. Must
search N locations. • Find address mapping for:
• Direct-mapped cache 31 0

Tag Block Offset


MM – Fully Associative
– Least flexible (more evictions) Addr
Direct Mapped Cache
– ________ search time ____ h(a) = block field – Direct Mapping
Only search 1 location.
• K-way Set Associative mapping – 4-way Set Associative
– Compromise
31

Tag Set Offset


0
MM – 8-way Set Associative
Addr
• 1-way set associative = ________ K-way Set Associative Mapping
• N-way set associative = ________ h(a) = set field
Only search k locations
– Work to search is ___ [k is usually small
enough to be done in parallel => O(_)]
8.73 8.74

Address Mapping Examples Fully Associative


• First find parameters: • Offset =
– B = Block size
– N = Cache blocks • Tag = Parameters:
B = 32
– S = Sets for 4-way and 8-way N = 64
S4-way = 16
• B is given as 32 bytes/block S8-way = 8
• N depends on cache size and block size
– N=
15 _ _ 0
• S for 4-way & 8-way
Tag Offset
– S4-way = N/k =

– S8-way = N/k =

8.75 8.76

Direct Mapping 4-Way Set Assoc. Mapping


• Offset = • Offset =
• Block = Parameters: • Set = Parameters:
B = 32 B = 32
• Tag = N = 64 • Tag = N = 64
S4-way = 16 S4-way = 16
S8-way = 8 S8-way = 8

15 __ __ __ __ 0 15 _ _ _ _ 0

Tag Block Offset Tag Set Offset


8.77 8.78

8-Way Set Assoc. Mapping Cache Operation Example


• Address Trace • Perform address breakdown and apply
• Offset = address trace
– R: 0x00a0
• 2-Way Set-Assoc, N=4, B=32 bytes/block
• Set = Parameters:
– W: 0x00f4
B = 32 Address Tag Set Byte Offset
– R: 0x00b0
• Tag = N = 64 0x00a0
S4-way = 16 – W: 0x2a2c 0x00f4
S8-way = 8
• Operations 0x00b0

0x2a2c
– Hit
Processor Cache Operation
– Fetch block XX Access
15 _ _ _ _ 0
– Evict block XX R: 0x00a0
Tag Set Offset (w/ or w/o WB) W: 0x00f4
– Final WB of block XX) R: 0x00b0

W: 0x2a2c

Done!

8.79 8.80

More of a Good Thing


• If one cache was good, more is likely better
– Add a Level 2 and even Level 3 cache
– Each is slightly larger, but slower Smaller
More
Faster
Expensive
Unit of Transfer:
Word or Byte

Registers

Higher
Levels
L1 Cache
~ 1ns
Unit of Transfer:
L2 Cache Cache block/line
1-8 words
~ 10ns
ADDING MULTIPLE LEVELS OF Lower
Levels
Main Memory
(Take advantage of spatial
locality)

~ 100 ns
CACHE Secondary Storage
Unit of Transfer:
Page
4KB-64KB words
~1-10 ms (Take advantage of
spatial locality)

Less
Larger Slower
Expensive
8.81 8.82

Principle of Inclusion Average Access Time


• When the cache at level i misses on data that is stored in level k (i < k), the • Define parameters
data is brought into all levels j where ___________
• This implies that lower levels always contains a _____ of higher levels – Hi = Hit Rate of Cache Level Li
• Example: (Note that 1-Hi = Miss rate)
– L1 contains most recently used data – Ti = Access time of level i
– L2 contains that data + data used earlier – Ri = Burst rate per word of level i (after startup access time)
– MM contains all data
– B = Block Size
• This make coherence far easier to maintain between levels
• Let us find TAVE = average access time

L1 Cache L2 Cache Main


Processor Memory Memory Memory

8.83 8.84

Tave without L2 cache Tave with L2 cache


• 2 possible cases: • 3 possible cases:
– Either we have a hit and pay the L1 cache hit time
– Either we have a hit and pay only the L1 cache hit time – Or we miss L1 but hit L2 and read in the block from L2
– Or we have a miss and read in the whole block to L1 and then – Or we miss L1 and L2 and read in the block from MM
read from L1 to the processor • Tave = ______________________________________________
• Tave = T1 + (1-H1)•[TMM + B•RMM]
(Miss Rate)*(Miss Penalty)
• For T1 = 10ns, H1 = 0.9, T2 = 20ns, R2 = 10ns, H2 = 0.98, B=8,
• For T1=10ns, H1 = 0.9, B=8, TMM=100ns, RMM=25ns TMM=100ns, RMM=25 ns
– Tave = 10 + [ (0.1) • (100+8•25) ] = 40 ns
• Tave =
8.85 8.86

Intel Nehalem Quad Core

UNDERSTANDING MISSES

8.87 8.88

Miss Rate Miss Rate & Block Size


• Reducing Miss Rate means lower TAVE • Block size too small: Not getting ___________________ to next higher level
• Block size too large: Time is spent getting data you __________ and that
• To analyze miss rate categorize them based on data occupies space in the cache that prevents other useful data from being
why they occur present

– ____________________ Misses
• ____________ to a block will always result in a miss
– ____________________ Misses
• Misses because the cache is ____________
– ____________________ Misses
• Misses due to ________________ (replacement of
direct or set associative)

Graph used courtesy “Computer Architecture: AQA, 3rd ed.”, Hennessey and Patterson
8.89 8.90

Hit/Miss Rate vs. Cache Size Miss Rate & Associativity


• Capacity is important up to a point • At reasonable cache sizes,
– Only the data the program is currently working associativity above ________-way
with (aka its "working set") need fit in the cache does not yield much improvement

OS:PP 2nd Ed.: Fig. 9.4


Graph used courtesy “Computer Architecture: AQA, 3rd ed.”, Hennessey and Patterson

8.91 8.92

Prefetching
• Hardware Prefetching
– On miss of block i, fetch block ______________
• Software Prefetching
– Special “_____________” Instructions
– Compiler inserts these instructions to give hints ahead of
time as to the upcoming access pattern

CACHE CONSCIOUS
PROGRAMMING
8.93 8.94

What Makes a Cache Work Working Sets


• Generally a program works with different sets of data at
• What are the necessary conditions different times
– Locations used to store cached data must be – Consider an image processing algorithm akin to JPEG encoding
• Perform data transformation on image pixels using several weighting
faster to access than original locations tables/arrays
– Some reasonable amount of ___________ • Create a table of frequencies
• Perform compression coding using that table of frequencies
– Access patterns must be somewhat ___________ • Replace pixels with compressed codes
• The data that the program is accessing in a small time window
is referred to as its working set
• We want that working set to fit in cache and make as much
reuse of that working set as possible while it is in cache
– Example of performing JPG compression:
• Keep weight tables in cache when performing data transformation
• Keep frequency table in cache when compressing

8.95 8.96

Cache-Conscious Programming
Row Major Col. Major
for(i=0; i<SIZE; i++) {
• Order of array indexing for(j=0; j<SIZE; j++) {
// Row-major
– Row major vs. column major A[i][j] = A[i][j]*2;
ordering // Column-major
A[j][i] = A[j][i]*2;
• Blocking (keeps working set small) } }
Example of row vs. column Memory Layout of
• Pointer-chasing major ordering matrix A
– Linked lists, graphs, tree data
structures that use pointers do not
exhibit good spatial locality
• General Principles
Original Blocked
– Keep working set reasonably ____ Matrix Matrix
(temporal locality)
– Use small ______ (spatial locality)
– Static structures usually better
than dynamic ones
Linked Lists
Memory Layout of
https://cartesianproduct.wordpress.com/tag/working-set/ Linked List
8.97 8.98

Blocked Matrix Multiply Blocked Multiply Results


• Traditional working set = * • Intel Nehalem processor
– 1 row of C, 1 row of A, NxN matrix B
C A B – L1D = 32 KB, L2 = 256KB, L3 = 8 MB
• Break NxN matrix into smaller BxB Traditional Multiply

matrices Blocked Matrix Multiply (N=2048)


120
– Perform matrix multiply on blocks
– Sum results of block multiplies to * 100 95.98 96.95

produce overall multiply result C A B


80 78.31
+
• Blocked multiply working set

Time (sec)
60
– Three BxB matrices += *
for(i = 0; i < N; i+=B) { … 40
for(j = 0; j < N; j+=B) { +
25.6 18.9 18.8
for(k = 0; k < N; k+=B) { 20 17.37
13.27 12.1 18.78
for(ii = i; ii < i+B; ii++) {
for(jj = j; jj < j+B; jj++) { = * 0
for(kk = k; kk < k+B; kk++) {
1 2 3 4 5 6 7 8 9 10
Cb[ii][jj] += Ab[ii][kk] * Bb[kk][jj];
C A B Block Dimension (B)
} } } } } }
Blocked Multiply

You might also like