CS356Unit8 Memory Notes
CS356Unit8 Memory Notes
Performance Metrics
• Latency: Total time for a _____________ to complete
– Often hard to improve dramatically
– Example: Takes roughly 4 years to get your bachelor's
CS356 Unit 8 degree
– From perspective of an ______________
8.3 8.4
8.7 8.8
MEMORY ORGANIZATION
8.9 8.10
– Already this is 2D because each qword is 64-bits (i.e. (64) 1-bit columns) keep their view of a linear
(1D) address space, the
• Physical View = 2D array of rows and columns hardware will translate the 2 1 0 = col
0x0000000410
– Each row may contain 1000’s of columns (bits) though we have to access address into several indices ...
Processor
A
at least 8- (and often 16-, 32-, or 64-) bits at a time (row, column, etc.) by 40 …
1-byte
Values
__________ the address row 2
...
0x000800
D
... bits into ___________ 64 row 1
...
0x000400
...
C800 4DB2 2004 1023 0x0018 • Analogy: When you check row 0
...
0x000000
CC31 5EEF 89AB 97CD 0x0010 …
...
into a hotel you receive 1 Physical View of Memory
2830 FB50 AB49 82FE 0x0008 0x000800
...
number but portions of the
0001 ACDE 1234 89AB 0x0000 0x000400
number represent _______ Rank/Bank Row Col
... 0000 0000000001 0000010000 = 0x000410
0x000000
1D Logical View _________ (e.g. 612)
(Each row is a single – Floor: 6
qword = 64-bits) • Each cell represent an 8-bit byte
2D Physical View – Aisle: 1 • Address broken into fields to identify
(e.g. a row is 1KB = 8Kb) row/col/etc. (i.e. higher dimension indices)
– Room: 2
8.11 8.12
0000000001
Row Cell Cell
Addr. Decoder
select one row (aka “____ line") WL[1]
Row
Addr
– Uses a hardware component SRAM and DRAM differ
known as a decoder in how each cell is
0x000410
Main memory organization • All cells in the selected row
made, but the
organization is roughly
the same
access their data bits and
0000010000
DRAM TECHNOLOGIES output them on their respective Cell
0 0
Cell
Col
“___________" WL[1023]
8.15 8.16
Row Decoder
tRAC=Access Time (__ns) = Time until data is ____ Row
– Technology: Fast Page Mode, DDR SDRAM, etc. Address
tRC
Row Decoder
Row
Row Decoder
Row
Address Address
Reg.
Reg.
Column
Address Column Latch/Register
Column
Reg/Cntr
Column Muxes
Address
Column Muxes
Data in / out
Fast Page Mode SDRAM (Synchronous DRAM) Data in / out
(Future address that fall in same row can Addition of clock signal. Will get up to ‘n’ consecutive
pull data from the latched row) words in the next ‘n’ clocks after column address is sent
8.19 8.20
Row
Address
• Accessing a chunk of N ___________ bytes is far
Reg.
Address
Column Muxes
Data in / out
DDR SDRAM (Double-Data Rate SDRAM)
Addition of clock signal. Will get up to ‘2n’ consecutive
words in the next ‘n’ clocks after column address is sent
8.21 8.22
CLK
Address Bank
Bank00
Row / Bank
Bank00
Column Bank 1Col Bank 2Col Bank 2Col
MC Address Row Row Row
Address
Bus 1 Access1 Access 2a
2a A Access 2b
2b b
Delay due to bank conflict
Bank 2 Bank 3
Data
Data Bus
Data 1 Data 2a Data 2b
= 0x004010 Access 1 maps to bank 1 while access 2a maps to bank 2
Row Bank Col allowing parallel access. However, access 2b immediately
Data
000000000100 00 0000010000 follows and maps to bank 2 causing a delay.
8.23 8.24
Programming Considerations
• For memory configuration given earlier, accesses to the same bank but different row
occur on an 32KB boundary
• Now consider a matrix multiply of 8K x 8K integer matrices (i.e. 32KB x 32KB)
• In code below…m2[0][0] @ 0x10010000 while m2[1][0] @ 0x10018000
– Controlled by the software/compiler contains the desired data Cache forwards desired word
1 Request word @ 4
• Cache memory is a small-ish, (kilobytes to a – If so, it can get the data 0x400028 Subsequent access to
Cache Memory
quickly ___________ 5 any word in that block
few megabytes) "fast" memory usually built can be serviced by the
onto the processor chip – Otherwise, it must go to 2 cached copy (fast)
the _______ main Cache does not
have the data and
• Will hold _____________ of the memory to get the data requests whole
latest data & instructions (but subsequent accesses cache line 400020-
Bus 40003f 3 Memory responds
accessed by the processor can be serviced by the
0x400000
• Managed by the ________ 0x400040
cache) 0x400000
0x400040
– _____________ to the software 0x400080
…
0x4000c0
0x400100
0x400140
Memory (RAM)
8.27 8.28
0x400040
0x400080
0x4000c0
0x400100
0x400140
8.29 8.30
8.31 8.32
8.35 8.36
Replacement Policies
• On a read- or write-miss, a new block must be
brought in
• This requires evicting a current block residing
in the cache
• Replacement policies
– ______
MAPPINGS
– ______
– ______
8.39 8.40
2nd Floor
104 124 204 224
1st Floor
• ______ bit – Indicates the cache and MM copies are 105
106
125
126
205
206
225
226
8 word (32-byte) blocks:
“inconsistent” (i.e. a write has been done to the 107 127 207 227 To refer to the range Addr. Range Binary
cached copy but not the main memory copy) 108 128 208 228 of rooms on the
second floor, left aisle
000-01f 0000 000 00000 -
11111
109 129 209 229
020-03f 0000 001 00000 -
– Used for ________________ caches Analogy: Hotel Rooms
we would just say
11111
rooms 20x
8.43 8.44
8.47 8.48
= T=0000 0000 a184 beef 0781 8821 When a block can be = T=0000 0000 a184 beef 0781 8821
V=0 D=0 Empty
5621 930c e400 cc33 ___________ you have 0 0
V=0 D=0 Empty
5621 930c e400 cc33
& to search ___________. & V=0
V
8.49 8.50
that map to a cache block, hit logic (searches) a184 beef 0781 8821
5621 930c e400 cc33
0x000
0x00f
can be done faster Cache a184 beef 0781 8821
5621 930c e400 cc33
0x010
0x01f
T=0111 1100 a184 beef 0781 8821
Cache Blk 0 a184 beef 0781 8821 0x020
• 3 Primary Methods V=1 D=0 5621 930c e400 cc33
5621 930c e400 cc33
0x02f
T=0100 0111 a184 beef 0781 8821
Cache Blk 1 ...
– ____________ Mapping V=1 D=1 5621 930c e400 cc33
a184 beef 0781 8821 0x420
T=0100 0111 a184 beef 0781 8821
Cache Blk 2 5621 930c e400 cc33 0x42f
– _______ Associative Mapping V=0 D=0 5621 930c e400 cc33
8.51 8.52
8.55 8.56
• Any block from memory can be put in any cache Write 0x004 0000 0000 0100
a184 beef 0781 8821 0x000 a184 beef 0781 8821 0x000
5621 930c e400 cc33 0x00f 5621 930c e400 cc33 0x00f
Cache a184 beef 0781 8821 0x010 Cache a184 beef 0781 8821 0x010
5621 930c e400 cc33 0x01f 5621 930c e400 cc33 0x01f
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
a184 beef 0781 8821
5621 930c e400 cc33
0x020
0x02f . 0x02f
T=0000 0000 a184 beef 0781 8821 a184 beef 0781 8821 0x030 T=0000 0000 a184 beef 0781 8821 a184 beef 0781 8821 0x030
V=0 D=1 Cache Blk 1
5621 930c e400 cc33 5621 930c e400 cc33 0x03f V=0 D=1 Cache Blk 1
5621 930c e400 cc33 . 5621 930c e400 cc33 0x03f
. ... ...
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 a184 beef 0781 8821 0xfc0 V=1 D=0 0x000-0x00f
5621 930c e400 cc33 . a184 beef 0781 8821 0xfc0
. 5621 930c e400 cc33 0xfcf 5621 930c e400 cc33 0xfcf
T=0000 0000 a184 beef 0781 8821 T=0000 0000 a184 beef 0781 8821
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 . a184 beef 0781 8821 0xfd0 V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0xfd0
5621 930c e400 cc33 0xfdf 5621 930c e400 cc33 0xfdf
a184 beef 0781 8821 0xfe0 Block 0 can go in any empty cache a184 beef 0781 8821 0xfe0
5621 930c e400 cc33 0 0xfef 5621 930c e400 cc33 0 0xfef
Tag Offset block, but let’s just pick cache block 2
a184 beef 0781 8821 F 0xff0 a184 beef 0781 8821 F 0xff0
Access 0x004 00000000 0100 5621 930c e400 cc33 0xfff 5621 930c e400 cc33 0xfff
8.57 8.58
8.59 8.60
T=1111 Address
10 = 080
a184 beef 0781 8821 a184 beef 0781 8821 0x080 T=0000 Address = 080 a184 beef 0781 8821 0x080
V=0 D=0 Cache Blk 0
5621 930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4 10
V=1 D=1
a184 beef 0781 8821
0x080-0x08f
5621 930c e400 cc33
MM Blk 08
5621 930c e400 cc33 0x08f
= 0 mod 4
. a184 beef 0781 8821 0x090 . a184 beef 0781 8821 0x090
T=0100 01 a184 beef 0781 8821
Cache Blk 1 MM Blk 09
5621 930c e400 cc33 = 1 mod 4 T=0100 01 a184 beef 0781 8821
Cache Blk 1 MM Blk 09
5621 930c e400 cc33 = 1 mod 4
V=0 D=0 5621 930c e400 cc33
. 0x09f V=0 D=0 5621 930c e400 cc33 . 0x09f
T=0100 01 a184 beef 0781 8821
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
a184 beef 0781 8821
MM Blk 0a 0x0a0 = 2 mod 4
T=0100 01 a184 beef 0781 8821
V=0 D=0 Cache Blk 2
5621 930c e400 cc33 .
5621 930c e400 cc33
0x0af V=0 D=0 Cache Blk 2
5621 930c e400 cc33 .
5621 930c e400 cc33
0x0af
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
a184 beef 0781 8821
MM Blk 0b 0x0b0 = 3 mod 4
T=0000 00 a184 beef 0781 8821 5621 930c e400 cc33 ... 5621 930c e400 cc33 ...
0x0bf T=0000 00 a184 beef 0781 8821 0x0bf
V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0 V=0 D=0 Cache Blk 3
5621 930c e400 cc33 a184 beef 0781 8821 0x0c0
MM Blk 0c
5621 930c e400 cc33
= 0 mod 4 MM Blk 0c
5621 930c e400 cc33
= 0 mod 4
0x0cf 0x0cf
... Block 0x080-0x08f hashes/maps to cache block 0 and ...
thus must be placed there
8.63 8.64
8.67 8.68
8.71 8.72
– S8-way = N/k =
8.75 8.76
15 __ __ __ __ 0 15 _ _ _ _ 0
0x2a2c
– Hit
Processor Cache Operation
– Fetch block XX Access
15 _ _ _ _ 0
– Evict block XX R: 0x00a0
Tag Set Offset (w/ or w/o WB) W: 0x00f4
– Final WB of block XX) R: 0x00b0
W: 0x2a2c
Done!
8.79 8.80
Registers
Higher
Levels
L1 Cache
~ 1ns
Unit of Transfer:
L2 Cache Cache block/line
1-8 words
~ 10ns
ADDING MULTIPLE LEVELS OF Lower
Levels
Main Memory
(Take advantage of spatial
locality)
~ 100 ns
CACHE Secondary Storage
Unit of Transfer:
Page
4KB-64KB words
~1-10 ms (Take advantage of
spatial locality)
Less
Larger Slower
Expensive
8.81 8.82
8.83 8.84
UNDERSTANDING MISSES
8.87 8.88
– ____________________ Misses
• ____________ to a block will always result in a miss
– ____________________ Misses
• Misses because the cache is ____________
– ____________________ Misses
• Misses due to ________________ (replacement of
direct or set associative)
Graph used courtesy “Computer Architecture: AQA, 3rd ed.”, Hennessey and Patterson
8.89 8.90
8.91 8.92
Prefetching
• Hardware Prefetching
– On miss of block i, fetch block ______________
• Software Prefetching
– Special “_____________” Instructions
– Compiler inserts these instructions to give hints ahead of
time as to the upcoming access pattern
CACHE CONSCIOUS
PROGRAMMING
8.93 8.94
8.95 8.96
Cache-Conscious Programming
Row Major Col. Major
for(i=0; i<SIZE; i++) {
• Order of array indexing for(j=0; j<SIZE; j++) {
// Row-major
– Row major vs. column major A[i][j] = A[i][j]*2;
ordering // Column-major
A[j][i] = A[j][i]*2;
• Blocking (keeps working set small) } }
Example of row vs. column Memory Layout of
• Pointer-chasing major ordering matrix A
– Linked lists, graphs, tree data
structures that use pointers do not
exhibit good spatial locality
• General Principles
Original Blocked
– Keep working set reasonably ____ Matrix Matrix
(temporal locality)
– Use small ______ (spatial locality)
– Static structures usually better
than dynamic ones
Linked Lists
Memory Layout of
https://cartesianproduct.wordpress.com/tag/working-set/ Linked List
8.97 8.98
Time (sec)
60
– Three BxB matrices += *
for(i = 0; i < N; i+=B) { … 40
for(j = 0; j < N; j+=B) { +
25.6 18.9 18.8
for(k = 0; k < N; k+=B) { 20 17.37
13.27 12.1 18.78
for(ii = i; ii < i+B; ii++) {
for(jj = j; jj < j+B; jj++) { = * 0
for(kk = k; kk < k+B; kk++) {
1 2 3 4 5 6 7 8 9 10
Cb[ii][jj] += Ab[ii][kk] * Bb[kk][jj];
C A B Block Dimension (B)
} } } } } }
Blocked Multiply