Ch2-MemoryHierarchyDesign Appb
Ch2-MemoryHierarchyDesign Appb
Architecture
Memory Hierarchy Design
Course 5MD00
Henk Corporaal
November 2013
[email protected]
• Material:
– Book of Hennessy & Patterson
– appendix B
+ chapter 2:
• 2.1-2.6
Cache
CPU Memory Main
1MB Memory
register file
4 Gigabyte
32x4 =
128 byte
registerfile
Advanced Computer Architecture pg 3
Memory Hierarchy
block / line
tags data
000
001
010
011
111
100
101
110
Memory
of locality Index
of? 1
2
1021
1022
1023
20 32
16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data
4K
entries
16 32 32 32 32
Mux
32
block address
25%
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0%
16
32
64
128
• Disadvantages :
– longer hit time (may determine processor cycle time!!)
– higher cost
– access requires more energy
• Example
– suppose a Cache with 128 entries, 4 words/entry
– Size is 128 x 16 = 2k Bytes
– Many addresses map to the same entry, e.g.
• Byte addresses 0-15, 2k - 2k+15, 4k - 4k+15, etc. all map to
entry 0
– What if program accesses repeatedly (in a loop) following
3 addresses: (0, 2k+4, and 4k+12)
– they will all miss, although only 3 words of the cache are
really used !!
22 8
Way 3
Index V Tag Data V Tag Data V Tag Data V Tag Data
0
1
2
Set 1
253
254
255
22 32
4-to-1 multiplexor
Hit Data
4-ways: Set contains 4 blocks
Fully associative cache contains 1 set, containing all blocks Advanced Computer Architecture pg 20
Example 1: cache calculations
• Assume
– Cache of 4K blocks
– 4 word block size
– 32 bit address
• Direct mapped (associativity=1) :
– 16 bytes per block = 2^4
– 32 bit address : 32-4=28 bits for index and tag
– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index
– Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative
– #sets=#blocks/associativity : 2K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative
– #sets=#blocks/associativity : 1K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-10) * 4 * 1K=72 Kbits
Advanced Computer Architecture pg 21
Example 2: cache mapping
• 3 caches consisting of 4 one-word blocks:
• The 3 Cs:
– Compulsory—First access to a block is always a
miss. Also called cold start misses
• misses in infinite cache
0.14
1-way
0.12
2-way
0.1 Conflict
Miss Rate per Type
4-way
Miss rate per type 0.08
8-way
0.06
Capacity
0.04
0.02
0
1
16
32
64
128
Cache Size (KB) Compulsory
100%
1-way
80% Conflict
2-way
4-way
Miss Rate per Type
60% 8-way
Miss rate per type
40%
Capacity
20%
0%
1
16
32
64
128
Compulsory
Cache Size (KB)
Advanced Computer Architecture pg 29
Improving Cache Performance
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache
• Definitions:
– Local miss rate— misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Advanced Computer Architecture pg 31
4. Second Level Cache (L2)
• Suppose processor with base CPI of 1.0
• Clock rate of 500 Mhz
• Main memory access time : 200 ns
• Miss rate per instruction primary cache : 5%
What improvement with second cache having 20ns access time,
reducing miss rate to memory to 2% ?
8
22
Way 3
4-to-1 multiplexor
Hit Data
Hit Time
0x4
Jump Add
control
PC
addr inst
Primary
Instruction
way Cache
Sequential Way
Branch Target Way
instruction BR BR BR
trace:
No write
buffering
Write
buffering
• Blocking
– Instead of accessing entire rows or columns,
subdivide matrices into blocks
– Requires more memory accesses but improves
locality of accesses
rows
for (row=0; row<5000; row++) array X
for (col=0; col<100; col++)
X[row][col] = X[row][col+1];
0.15
0.1
Miss Rate
0.05
0
0 50 100 150
Blocking Factor Advanced Computer Architecture pg 62
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky (nasa7)
compress
1 1.5 2 2.5 3
Performance Improvement
• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then
prefetch b+3N etc.
Pentium 4 Pre-fetching
Advanced Computer Architecture pg 65
Issues in HW Prefetching
• Usefulness – should produce hits
– if you are unlucky, the pretetched data/instr is not needed
• Timeliness – not too late and not too early
• Cache and bandwidth pollution
L1
Instruction
CPU Unified L2
Cache
RF L1 Data
Prefetched data
• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache
• DRAM
– Must be re-written after being read
– Must also be periodically refeshed
• Every ~ 8 ms
• Each row can be refreshed simultaneously
– One transistor/bit
– Address lines are multiplexed:
• Upper half of address: row access strobe (RAS)
• Lower half of address: column access strobe (CAS)
• Some optimizations:
– Multiple accesses to same row
– Synchronous DRAM
• Added clock to DRAM interface
• Burst mode with critical word first
– Wider interfaces
– Double data rate (DDR)
– Multiple banks on each DRAM device
Bank 4
Bank 3
Bank 2
Bank 1
Address register
decoder
Memory Array
Row Buffer
Row Buffer form a memory array
Row Buffer
• Multiple arrays are
organized as different
Sense amplifiers
(row buffer) banks
– Typical number of
LS bits Column banks are 4, 8 and 16
decoder
• Sense amplifiers raise
the voltage level on the
Data
bitlines to read the data
out
• Role of architecture:
– Provide user mode and supervisor mode
– Protect certain aspects of CPU state
– Provide mechanisms for switching
between user and supervisor mode
– Provide mechanisms to limit memory
accesses
• read-only pages
• executable pages
• shared pages
– Provide TLB to translate addresses
Advanced Computer Architecture pg 82
Memory organization
• The operating system, together with the MMU hardware, take
care of separating the programs.
• Each program runs in its own ‘virtual’ environment, and uses logical
addressing that is (often) different the the actual physical
addresses.
No: access
2K block
violation 2K block
No: load 2K block 2K block
Yes: 2K block
Logical Virtual from swap file 2K block
CPU Process Physical
address address Memory on disk Main memory
table
Manager Yes:
PhysicalPhysical
addressaddress
Cache 2K block
memory 2K block
2K block
Each program thinks
Checks whether the
that it owns all the
requested address
memory.
is ‘in core’
Disk addresses
Advantages:
illusion of having more physical memory
program relocation
protection
Advanced Computer Architecture pg 86
Pages: virtual memory blocks
• Page faults: the data is not in memory, retrieve it from
disk
– huge miss penalty, thus pages should be fairly large (e.g.,
4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is tooVirtual
expensive
address so we use writeback
31 30 29 28 27 15 14 13 12 11 10 9 8 3210
Translation
29 28 27 15 14 13 12 11 10 9 8 3210
Physical address
1
1
1
1
0
1
1
0
1 Disk storage
1
0
1
Virtual address
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
20 12
Page table
18
If 0 then page is not
present in memory
29 28 27 15 14 13 12 11 10 9 8 3 2 1 0
Physical address
Solution
Size = Nentries * Size-of-entry = 2 40 / 2 12 * 4 bytes = 1 Gbyte
Reduce size:
Dynamic allocation of page table entries
Hashing: inverted page table
1 entry per physical available instead of virtual page
Page the page table itself (i.e. part of it can be on disk)
Use larger page size (multiple page sizes)
1
1
Virtual page Physical memory
1
number 1
0
1
Physical page
Valid or disk address
1
1
1 Disk storage
1
0
1
1
0
Page table 1
1
0
1
TLB access
No Yes
Write?
Write protection
exception Write data into cache,
No Yes update the tag, and put
Cache miss stall Cache hit? the data and the address
into the write buffer
Deliver data
to the CPU