0% found this document useful (0 votes)
15 views101 pages

Ch2-MemoryHierarchyDesign Appb

This document provides an overview of memory hierarchy design and cache optimizations. It discusses the memory hierarchy from registers to main memory. It covers cache basics like block placement, identification, and replacement. It also summarizes six common cache optimizations: using larger block sizes, larger caches, higher associativity, multi-level caches, prioritizing read misses, and avoiding address translation during indexing to improve performance.

Uploaded by

max scalom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views101 pages

Ch2-MemoryHierarchyDesign Appb

This document provides an overview of memory hierarchy design and cache optimizations. It discusses the memory hierarchy from registers to main memory. It covers cache basics like block placement, identification, and replacement. It also summarizes six common cache optimizations: using larger block sizes, larger caches, higher associativity, multi-level caches, prioritizing read misses, and avoiding address translation during indexing to improve performance.

Uploaded by

max scalom
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 101

Advanced Computer

Architecture
Memory Hierarchy Design

Course 5MD00

Henk Corporaal
November 2013
[email protected]

Advanced Computer Architecture pg 1


Welcome!
This lecture:
• Memory Hierarchy Design
– Hierarchy
– Recap of Caching (App B)
– Many Cache and Memory Hierarchy Optimizations
– VM: virtual memory support
– AMR Cortex-A8 and Intel Core i7 examples

• Material:
– Book of Hennessy & Patterson
– appendix B
+ chapter 2:
• 2.1-2.6

Advanced Computer Architecture pg 2


Registers vs. Memory
• Arithmetic instructions operands must be registers,
— only 32 registers provided (Why?)
• Compiler associates variables with registers
• Question: what to do about programs with lots of
variables ?

Fast Slower Slowest


(2000Mhz) (500Mhz) (133Mhz)

Cache
CPU Memory Main
1MB Memory
register file
4 Gigabyte

32x4 =
128 byte
registerfile
Advanced Computer Architecture pg 3
Memory Hierarchy

Advanced Computer Architecture pg 4


Why does a small cache still work?
• LOCALITY

– Temporal: you are likely accessing the same address soon


again

– Spatial: you are likely accessing another address close to


the current one in the near future

Advanced Computer Architecture pg 5


Memory Performance Gap

Advanced Computer Architecture pg 6


Memory Hierarchy Design
• Memory hierarchy design becomes more crucial
with recent multi-core processors:
– Aggregate peak bandwidth grows with # cores:

– Intel Core i7 can generate two references per core per


clock
– Four cores and 3.2 GHz clock
• 25.6 billion 64-bit data references/second +
• 12.8 billion 128-bit instruction references
• = 409.6 GB/s!
– DRAM bandwidth is only 6% of this (25 GB/s)
– Requires:
• Multi-port, pipelined caches
• Two levels of cache per core
• Shared third-level cache on chip

Advanced Computer Architecture pg 7


Memory Hierarchy Basics

• Note that speculative and multithreaded


processors may execute other instructions
during a miss
– Reduces performance impact of misses

Advanced Computer Architecture pg 8


Cache operation
Memory / Lower level

Cache / Higher level

block / line

tags data

Advanced Computer Architecture pg 9


Direct Mapped Cache
• Mapping: address is modulo the number of blocks in the
cache
Cache

000
001
010
011

111
100
101
110

00001 00101 01001 01101 10001 10101 11001 11101

Memory

Advanced Computer Architecture pg 10


Review: Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper
level? (Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper
level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, FIFO, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
Advanced Computer Architecture pg 11
Direct Mapped Cache
Address (bit positions)
31 30 13 12 11 2 10
Byte
offset

Q:What kind Hit


Tag
20 10
Data

of locality Index

are we taking Index Valid Tag Data


advantage 0

of? 1
2

1021
1022
1023
20 32

Advanced Computer Architecture pg 12


Direct Mapped Cache
• Taking advantage of spatial locality:
Address (bit bitpositions)
Address (showing positions)
31 16 15 4 32 1 0

16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data

4K
entries

16 32 32 32 32

Mux
32

Advanced Computer Architecture pg 13


Cache Basics
• cache_size = Nsets x Assoc x Block_size
• block_address = Byte_address DIV Block_size in
bytes
• index = Block_address MOD Nsets

• Because the block size and the number of sets are


(usually) powers of two, DIV and MOD can be performed
efficiently

block address

tag index block


offset
31 … …210

Advanced Computer Architecture pg 14


6 basic cache optimizations
(App. B.3)
• Reduces miss rate
1. Larger block size
2. Bigger cache
3. Associative cache (higher associativity)
• reduces conflict rate
• Reduce miss penalty
4. Multi-level caches
5. Give priority to read misses over write misses
• Reduce hit time
6. Avoid address translation during indexing of the cache

Advanced Computer Architecture pg 15


Improving Cache Performance
T = Ninstr * CPI * Tcycle
CPI (with cache) = CPI_base + CPI_cachepenalty
CPI_cachepenalty = .............................................

1. Reduce the miss rate


2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 16


1. Increase Block Size

25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0%
16

32

64

128

Block Size (bytes) 256

Advanced Computer Architecture pg 17


2. Larger Caches

• Increase capacity of cache

• Disadvantages :
– longer hit time (may determine processor cycle time!!)
– higher cost
– access requires more energy

Advanced Computer Architecture pg 18


3. Use / Increase Associativity

• Direct mapped caches have lots of conflict misses

• Example
– suppose a Cache with 128 entries, 4 words/entry
– Size is 128 x 16 = 2k Bytes
– Many addresses map to the same entry, e.g.
• Byte addresses 0-15, 2k - 2k+15, 4k - 4k+15, etc. all map to
entry 0
– What if program accesses repeatedly (in a loop) following
3 addresses: (0, 2k+4, and 4k+12) 

– they will all miss, although only 3 words of the cache are
really used !!

Advanced Computer Architecture pg 19


A 4-Way Set-Associative Cache
Address
31 30 12 11 10 9 8 321 0

22 8

Way 3
Index V Tag Data V Tag Data V Tag Data V Tag Data
0
1
2
Set 1
253
254
255
22 32

4-to-1 multiplexor

Hit Data
4-ways: Set contains 4 blocks
Fully associative cache contains 1 set, containing all blocks Advanced Computer Architecture pg 20
Example 1: cache calculations
• Assume
– Cache of 4K blocks
– 4 word block size
– 32 bit address
• Direct mapped (associativity=1) :
– 16 bytes per block = 2^4
– 32 bit address : 32-4=28 bits for index and tag
– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index
– Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative
– #sets=#blocks/associativity : 2K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative
– #sets=#blocks/associativity : 1K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-10) * 4 * 1K=72 Kbits
Advanced Computer Architecture pg 21
Example 2: cache mapping
• 3 caches consisting of 4 one-word blocks:

• Cache 1 : fully associative


• Cache 2 : two-way set associative
• Cache 3 : direct mapped

• Suppose following sequence of block addresses:


0, 8, 0, 6, 8

Advanced Computer Architecture pg 22


Example 2: Direct Mapped
Block address Cache Block
0 0 mod 4=0
6 6 mod 4=2
8 8 mod 4=0

Address of Hit or Location Location Location Location


memory block miss 0 1 2 3
0 miss Mem[0]
8 miss Mem[8]
0 miss Mem[0]
6 miss Mem[0] Mem[6]
8 miss Mem[8] Mem[6]

Coloured = new entry = miss


Advanced Computer Architecture pg 23
Example 2: 2-way Set Associative:
2 sets
Block address Cache Block
0 0 mod 2=0
(so all in set/location 0)
6 6 mod 2=0
8 8 mod 2=0

Address of Hit or SET 0 SET 0 SET 1 SET 1


memory block miss entry 0 entry 1 entry 0 entry 1
0 Miss Mem[0]
8 Miss Mem[0] Mem[8]
0 Hit Mem[0] Mem[8]
6 Miss Mem[0] Mem[6]
8 Miss Mem[8] Mem[6]

LEAST RECENTLY USED BLOCK

Advanced Computer Architecture pg 24


Example 2: Fully associative
(4 way assoc., 1 set)

Address of Hit or Block 0 Block 1 Block 2 Block 3


memory block miss
0 Miss Mem[0]

8 Miss Mem[0] Mem[8]

0 Hit Mem[0] Mem[8]

6 Miss Mem[0] Mem[8] Mem[6]

8 Hit Mem[0] Mem[8] Mem[6]

Advanced Computer Architecture pg 25


Classifying Misses: the 3 Cs

• The 3 Cs:
– Compulsory—First access to a block is always a
miss. Also called cold start misses
• misses in infinite cache

– Capacity—Misses resulting from the finite


capacity of the cache
• misses in fully associative cache with optimal replacement strategy

– Conflict—Misses occurring because several blocks


map to the same set. Also called collision misses
• remaining misses

Advanced Computer Architecture pg 26


3 Cs: Compulsory, Capacity, Conflict
In all cases, assume total cache size not changed

What happens if we:


1) Change Block Size:
Which of 3Cs is obviously affected? compulsory
2) Change Cache Size:
Which of 3Cs is obviously affected? capacity
misses
3) Introduce higher associativity :
Which of 3Cs is obviously affected? conflict
misses

Advanced Computer Architecture pg 27


3Cs Absolute Miss Rate (SPEC92)

0.14
1-way
0.12
2-way
0.1 Conflict
Miss Rate per Type

4-way
Miss rate per type 0.08
8-way
0.06
Capacity
0.04
0.02
0
1

16

32

64

128
Cache Size (KB) Compulsory

Advanced Computer Architecture pg 28


3Cs Relative Miss Rate

100%
1-way
80% Conflict
2-way
4-way
Miss Rate per Type

60% 8-way
Miss rate per type

40%
Capacity

20%

0%
1

16

32

64

128
Compulsory
Cache Size (KB)
Advanced Computer Architecture pg 29
Improving Cache Performance
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 30


4. Second Level Cache (L2)
• Most CPUs
– have an L1 cache small enough to match the cycle time (reduce
the time to hit the cache)
– have an L2 cache large enough and with sufficient associativity
to capture most memory accesses (reduce miss rate)

• L2 Equations, Average Memory Access Time (AMAT):


AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Advanced Computer Architecture pg 31
4. Second Level Cache (L2)
• Suppose processor with base CPI of 1.0
• Clock rate of 500 Mhz
• Main memory access time : 200 ns
• Miss rate per instruction primary cache : 5%
What improvement with second cache having 20ns access time,
reducing miss rate to memory to 2% ?

• Miss penalty : 200 ns/ 2ns per cycle=100 clock cycles


• Effective CPI=base CPI+ memory stall per instruction = ?
– 1 level cache : total CPI=1+5%*100=6
– 2 level cache : a miss in first level cache is satisfied by second cache or
memory
• Access second level cache : 20 ns / 2ns per cycle=10 clock cycles
• If miss in second cache, then access memory : in 2% of the cases
• Total CPI=1+primary stalls per instruction +secondary stalls per instruction
• Total CPI=1+5%*10+2%*100=3.5

Machine with L2 cache : 6/3.5=1.7 times faster


Advanced Computer Architecture pg 32
4. Second Level Cache

• Global cache miss is similar to single cache miss rate of second


level cache provided L2 cache is much bigger than L1.
• Local cache rate is NOT good measure of secondary caches as it is
function of L1 cache.
Global cache miss rate should be used.
Advanced Computer Architecture pg 33
4. Second Level Cache

Advanced Computer Architecture pg 34


5. Read Priority over Write on Miss
• Write-through with write buffers can cause RAW data
hazards:
SW 512(R0),R3 ; Mem[512] = R3 Map to same
LW R1,1024(R0) ; R1 = Mem[1024] cache block
LW R2,512(R0) ; R2 = Mem[512]

• Problem: if write buffer used, final LW may read wrong


value from memory !!

• Solution 1 : Simply wait for write buffer to empty


– increases read miss penalty (old MIPS 1000 by 50% )
• Solution 2 : Check write buffer contents before read: if
no conflicts, let read continue

Advanced Computer Architecture pg 35


5. Read Priority over Write on Miss
What about write-back?
• Dirty bit: whenever a write is cached, this bit is
set (made a 1) to tell the cache controller "when
you decide to re-use this cache line for a
different address, you need to write the current
contents back to memory”
What if read-miss:
• Normal: Write dirty block to memory, then do
the read
• Instead: Copy dirty block to a write buffer, then
do the read, then the write
• Fewer CPU stalls since restarts as soon as read
done
Advanced Computer Architecture pg 36
Improving Cache Performance
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 37


6. No address translation during cache access

Advanced Computer Architecture pg 38


11 Advanced Cache Optimizations (2.2)
• Reducing hit time • Reducing Miss
1.Small and simple Penalty
caches 7. Critical word first
2.Way prediction 8. Merging write
buffers
3.Trace caches • Reducing Miss Rate
9. Compiler
• Increasing cache optimizations
bandwidth • Reducing miss
4.Pipelined caches penalty or miss rate
5.Multibanked caches via parallelism
6.Nonblocking caches 10.Hardware
prefetching
11.Compiler prefetching
Advanced Computer Architecture pg 39
1. Small and simple first level caches
• Critical timing path:
–addressing tag memory, then
–comparing tags, then
–selecting correct set

• Direct-mapped caches can overlap tag


compare and transmission of data

• Lower associativity reduces power because


–fewer cache lines are accessed, and
–less complex mux to select the right way

Advanced Computer Architecture pg 40


Recap: 4-Way Set-Associative Cache
Address
31 30 12 11 10 9 8 321 0

8
22
Way 3

Index V Tag Data V Tag Data V Tag Data V Tag Data


0
1
2 Set 2
253
254
255
22 32

4-to-1 multiplexor

Hit Data

Advanced Computer Architecture pg 41


L1 Size and Associativity

Access time vs. size and associativity

Advanced Computer Architecture pg 42


L1 Size and Associativity

Energy per read vs. size and associativity

Advanced Computer Architecture pg 43


2. Fast Hit via Way Prediction
• Make set-associative caches faster
• Keep extra bits in cache to predict the “way,” or block
within the set, of next cache access.
– Multiplexor is set early to select desired block, only 1 tag
comparison performed
– Miss  first check other blocks for matches in next clock cycle
• Accuracy  85%
• Saves also energy

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

Hit Time

Way-Miss Hit Time Miss Penalty

Advanced Computer Architecture pg 44


Way Predicting Instruction Cache
(Alpha 21264-like)
Jump target

0x4
Jump Add
control

PC

addr inst
Primary
Instruction
way Cache

Sequential Way
Branch Target Way

Advanced Computer Architecture pg 45


3. Fast (Inst. Cache) Hit via Trace Cache
Key Idea: Pack multiple non-contiguous basic blocks
into one contiguous trace cache line

instruction BR BR BR
trace:

trace cache line: BR BR BR

• Single fetch brings in multiple basic blocks


• Trace cache indexed by start address and next n
branch predictions

Advanced Computer Architecture pg 46


3. Fast Hit times via Trace Cache
• Trace cache in Pentium 4 and its successors
 Dynamic instr. traces cached (in level 1 cache)
 Cache the micro-ops vs. x86 instructions
• Decode/translate from x86 to micro-ops on trace cache
miss

+ better utilize long blocks (don’t exit in middle of


block, don’t enter at label in middle of block)
- complicated address mapping since addresses no
longer aligned to power-of-2 multiples of word
size
- instructions may appear multiple times in
multiple dynamic traces due to different branch
outcomes
Advanced Computer Architecture pg 47
4. Pipelining Cache
• Pipeline cache access to improve bandwidth
– Examples:
• Pentium: 1 cycle
• Pentium Pro – Pentium III: 2 cycles
• Pentium 4 – Core i7: 4 cycles

• Increases branch mis-prediction penalty


• Makes it easier to increase associativity

Advanced Computer Architecture pg 48


5. Multi-banked Caches
• Organize cache as independent banks to
support simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for
L2

• Interleave banks according to block address

Advanced Computer Architecture pg 49


5. Multi-banked caches
• Banking works best when accesses naturally
spread themselves across banks  mapping of
addresses to banks affects behavior of memory
system

• Simple mapping that works well is “sequential


interleaving”
– Spread block addresses sequentially across banks
– E.g., with 4 banks,
• Bank 0 has all blocks with address%4 = 0;
• Bank 1 has all blocks whose address%4 = 1; …

Advanced Computer Architecture pg 50


6. Nonblocking Caches
• Allow hits before previous misses complete
– “Hit under miss”
– “Hit under multiple miss”
• L2 must support this
• In general, processors can hide L1 miss penalty but not
L2 miss penalty

• Requires OoO processor


• Makes cache control much more complex

Advanced Computer Architecture pg 51


Non-blocking cache

Advanced Computer Architecture pg 52


7. Critical Word First, Early Restart
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
• Early restart
– Request words in normal order
– Send missed work to the processor as soon as it
arrives

• Effectiveness of these strategies depends


on block size and likelihood of another
access to the portion of the block that has
not yet been fetched

Advanced Computer Architecture pg 53


8. Merging Write Buffer
• When storing to a block that is already pending in
the write buffer, update write buffer
• Reduces stalls due to full write buffer
• Do not apply to I/O addresses

No write
buffering

Write
buffering

Advanced Computer Architecture pg 54


9. Compiler Optimizations
• Loop Interchange
– Swap nested loops to access memory in
sequential order

• Blocking
– Instead of accessing entire rows or columns,
subdivide matrices into blocks
– Requires more memory accesses but improves
locality of accesses

Advanced Computer Architecture pg 55


9. Reducing Misses by Compiler Optimizations
• Instructions
– Reorder procedures in memory so as to reduce
conflict misses
– Profiling to look at conflicts (using developed tools)
• Data
– Merging Arrays: improve spatial locality by single
array of compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access
data in order stored in memory
– Loop Fusion: combine 2 independent loops that have
same looping and some variables overlap
– Blocking: Improve temporal locality by accessing
“blocks” of data repeatedly vs. going down whole
columns or rows
• Huge miss reductions possible !!
Advanced Computer Architecture pg 56
Merging Arrays
int val[SIZE]; struct record{
int key[SIZE]; int val;
int key;
for (i=0; i<SIZE; i++){ };
key[i] = newkey; struct record records[SIZE];
val[i]++;
} for (i=0; i<SIZE; i++){
records[i].key = newkey;
records[i].val++;
}

• Reduces conflicts between val & key and improves


spatial locality

Advanced Computer Architecture pg 57


Loop Interchange

for (col=0; col<100; col++)


columns
for (row=0; row<5000; row++)
X[row][col] = X[row][col+1];

rows
for (row=0; row<5000; row++) array X
for (col=0; col<100; col++)
X[row][col] = X[row][col+1];

• Sequential accesses instead of striding


through memory every 100 words
• Improves spatial locality
Advanced Computer Architecture pg 58
Loop Fusion
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];

for (i = 0; i < N; i++) Reference can be directly to register


for (j = 0; j < N; j++){
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}

Splitted loops: every access to a and c misses. Fused loops:


only 1st access misses. Improves temporal locality
Advanced Computer Architecture pg 59
Blocking (Tiling) applied to array
multiplication
for (i=0; i<N; i++)
for (j=0; j<N; j++){
c[i][j] = 0.0; c
for (k=0; k<N; k++)
c[i][j] += a[i][k]*b[k][j];
=
}

• The two inner loops: a


– Read all NxN elements of b
– Read all N elements of one row of a repeatedly
x
– Write all N elements of one row of c
• If a whole matrix does not fit in the
cache many cache misses result. b
• Idea: compute on BxB submatrix that
fits in the cache
Advanced Computer Architecture pg 60
Blocking Example

for (ii=0; ii<N; ii+=B)


for (jj=0; jj<N; jj+=B) c
for (i=ii; i<min(ii+B-1,N); i++)
for (j=jj; j<min(jj+B-1,N); j++){
c[i][j] = 0.0; =
for (k=0; k<N; k++)
c[i][j] += a[i][k]*b[k][j];
} a

• B is called Blocking Factor


x
• Can reduce capacity misses from 2N3 +
N2 to 2N3/B +N2
b

Advanced Computer Architecture pg 61


Reducing Conflict Misses by Blocking
• Conflict misses in caches vs. Blocking size
– Lam et al [1991]: a blocking factor of 24 had a fifth the
misses compared to 48, despite both fit in cache

0.15

0.1
Miss Rate

Direct Mapped Cache

0.05

Fully Associative Cache

0
0 50 100 150
Blocking Factor Advanced Computer Architecture pg 62
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)

vpenta (nasa7)

gmty (nasa7)

tomcatv

btrix (nasa7)

mxm (nasa7)

spice

cholesky (nasa7)

compress

1 1.5 2 2.5 3
Performance Improvement

merged arrays loop interchange loop fusion blocking

Advanced Computer Architecture pg 63


10. Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch block (b + 1) upon miss on b

• One Block Lookahead (OBL) scheme


– Initiate prefetch for block (b + 1) when block b is accessed
– Why is this different from doubling block size?
– Can extend to N block lookahead

• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then
prefetch b+3N etc.

• Example: IBM Power 5 [2003] supports eight


independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access

• Note: instructions are usually prefetched in instr. buffer


Advanced Computer Architecture pg 64
10. Hardware Prefetching
• Fetch two blocks on miss (include next
sequential block)

Pentium 4 Pre-fetching
Advanced Computer Architecture pg 65
Issues in HW Prefetching
• Usefulness – should produce hits
– if you are unlucky, the pretetched data/instr is not needed
• Timeliness – not too late and not too early
• Cache and bandwidth pollution

L1
Instruction
CPU Unified L2
Cache
RF L1 Data

Prefetched data

Advanced Computer Architecture pg 66


Issues in HW prefetching: stream buffer
• Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i)
and the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move
stream buffer block into cache and prefetch next
block (i+2)
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
Advanced Computer Architecture pg 67
11. Compiler Prefetching
• Insert prefetch instructions before data is needed
• Non-faulting: prefetch doesn’t cause exceptions

• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache

• Combine with loop unrolling and software pipelining

• Cost of prefetching: more bandwidth (speculation) !!

Advanced Computer Architecture pg 68


Hit Band- Miss Miss HW cost/
Technique Comment
Time width penalty rate complexity

Small and simple caches + – 0 Trivial; widely used


Way-predicting caches + 1 Used in Pentium 4
Trace caches + 3 Used in Pentium 4
Pipelined cache access – + 1 Widely used
Nonblocking caches + + 3 Widely used

Banked caches Used in L2 of Opteron and


+ 1 Niagara
Critical word first and early
restart + 2 Widely used
Merging write buffer + 1 Widely used with write through
Compiler techniques to reduce Software is a challenge; some
cache misses + 0 computers have compiler option
Hardware prefetching of 2 instr. Many prefetch instructions;
instructions and data + + 3 data AMD Opteron prefetches data

Compiler-controlled prefetching Needs nonblocking cache; in


+ + 3 many CPUs

Advanced Computer Architecture pg 69


Memory Technology
• Performance metrics
– Latency is concern of cache
– Bandwidth is concern of multiprocessors and
I/O
– Access time
• Time between read request and when desired word
arrives
– Cycle time
• Minimum time between unrelated requests to memory

• DRAM used for main memory, SRAM used


for cache

Advanced Computer Architecture pg 70


Memory Technology
• SRAM
– Requires low power to retain bit
– Requires 6 transistors/bit

• DRAM
– Must be re-written after being read
– Must also be periodically refeshed
• Every ~ 8 ms
• Each row can be refreshed simultaneously
– One transistor/bit
– Address lines are multiplexed:
• Upper half of address: row access strobe (RAS)
• Lower half of address: column access strobe (CAS)

Advanced Computer Architecture pg 71


Memory Technology
• Amdahl:
– Memory capacity should grow linearly with processor
speed
– Unfortunately, memory capacity and speed has not kept
pace with processors

• Some optimizations:
– Multiple accesses to same row
– Synchronous DRAM
• Added clock to DRAM interface
• Burst mode with critical word first
– Wider interfaces
– Double data rate (DDR)
– Multiple banks on each DRAM device

Advanced Computer Architecture pg 72


SRAM vs DRAM
Static Random Access Memory Dynamic Random Access Memory

► Bitlines driven by transistors ► A bit


is stored as charge on the
- Fast (10x) capacitor
► Bit
cell loses charge over time
► 1 transistor and 1 capacitor vs.
6 transistors (read operation and circuit
– Large (~6-10x) leakage)
- Must periodically refresh
- Hence the name Dynamic RAM

Credits: J.Leverich, Stanford Advanced Computer Architecture pg 73


DRAM: Internal architecture

Bank 4
Bank 3
Bank 2
Bank 1
Address register

decoder

Address MS bits • Bit cells are arranged to


Row

Memory Array
Row Buffer
Row Buffer form a memory array
Row Buffer
• Multiple arrays are
organized as different
Sense amplifiers
(row buffer) banks
– Typical number of
LS bits Column banks are 4, 8 and 16
decoder
• Sense amplifiers raise
the voltage level on the
Data
bitlines to read the data
out

Credits: J.Leverich, Stanford Advanced Computer Architecture pg 74


Memory Optimizations

Advanced Computer Architecture pg 75


Memory Optimizations

Advanced Computer Architecture pg 76


Memory Optimizations
• DDR:
– DDR2
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
– DDR3
• 1.5 V
• 800 MHz
– DDR4
• 1-1.2 V
• 1600 MHz

• GDDR5 is graphics memory based on DDR3

Advanced Computer Architecture pg 77


Memory Optimizations
• Graphics memory:
– Achieve 2-5 X bandwidth per DRAM vs. DDR3
• Wider interfaces (32 vs. 16 bit)
• Higher clock rate
– Possible because they are attached via soldering instead of socketted
DIMM modules

• Reducing power in SDRAMs:


– Lower voltage
– Low power mode (ignores clock, continues to refresh)

Advanced Computer Architecture pg 78


Memory Power Consumption

Advanced Computer Architecture pg 79


Flash Memory
• Type of EEPROM
–(Electrical Erasable Programmable Read Only
Memory)
• Must be erased (in blocks) before being
overwritten
• Non volatile
• Limited number of write cycles
• Cheaper than SDRAM, more expensive than disk
• Slower than SRAM, faster than disk

Advanced Computer Architecture pg 80


Memory Dependability
• Memory is susceptible to cosmic rays
• Soft errors: dynamic errors
– Detected and fixed by error correcting codes (ECC)
• Hard errors: permanent errors
– Use sparse rows to replace defective rows

• Chipkill: a RAID-like error recovery technique

Advanced Computer Architecture pg 81


Virtual Memory
• Protection via virtual memory
– Keeps processes in their own memory
space

• Role of architecture:
– Provide user mode and supervisor mode
– Protect certain aspects of CPU state
– Provide mechanisms for switching
between user and supervisor mode
– Provide mechanisms to limit memory
accesses
• read-only pages
• executable pages
• shared pages
– Provide TLB to translate addresses
Advanced Computer Architecture pg 82
Memory organization
• The operating system, together with the MMU hardware, take
care of separating the programs.
• Each program runs in its own ‘virtual’ environment, and uses logical
addressing that is (often) different the the actual physical
addresses.

• Within the virtual world of a program, the full 4 Gigabytes


address space is available. (Less under Windows)

• In the von Neumann architecture, we need to manage the memory


space to store the following:
Main memory
– The machine code of the program
– The data:
• Global variables and constants Program
• The stack/local variables +
• The heap Data

Advanced Computer Architecture pg 83


Memory Organization: more detail
The memory that is reserved 0xFFFFFFFF
by the memory manager
Variable
Heap size
If the heap and the
stack collide, we’re out
of memory
The local variables in the
routines. With each routine Free memory
call, a new set of variables
if put in the stack.
Stack pointer
Before the first line Variable
of the program is run, Stack size
all global variables and Fixed
constants are initialized. Global variables size

The program itself: Fixed


a set of machine Machine code size
instructions.
This is in the .exe 0x00000000
Advanced Computer Architecture pg 84
Memory management
• Problem: many programs run simultaneously
• MMU manages the memory access.
Swap file
on
Memory Management Unit hard disk

No: access
2K block
violation 2K block
No: load 2K block 2K block
Yes: 2K block
Logical Virtual from swap file 2K block
CPU Process Physical
address address Memory on disk Main memory
table
Manager Yes:
PhysicalPhysical
addressaddress
Cache 2K block
memory 2K block
2K block
Each program thinks
Checks whether the
that it owns all the
requested address
memory.
is ‘in core’

Advanced Computer Architecture pg 85


Virtual Memory
• Main memory can act as a cache for the secondary
storage (disk)
virtual memory
Virtual addresses physical memory
Physical addresses
Address translation

Disk addresses


Advantages:
 illusion of having more physical memory
 program relocation
 protection
Advanced Computer Architecture pg 86
Pages: virtual memory blocks
• Page faults: the data is not in memory, retrieve it from
disk
– huge miss penalty, thus pages should be fairly large (e.g.,
4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is tooVirtual
expensive
address so we use writeback
31 30 29 28 27 15 14 13 12 11 10 9 8 3210

Virtual page number Page offset

Translation

29 28 27 15 14 13 12 11 10 9 8 3210

Physical page number Page offset

Physical address

Advanced Computer Architecture pg 87


Page Tables
Virtual page
number
Page table
Physical memory
Physical page or
Valid disk address

1
1
1
1
0
1
1
0
1 Disk storage
1
0
1

Advanced Computer Architecture pg 88


Page Tables
Page table register

Virtual address
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Virtual page number Page offset

20 12

Valid Physical page number

Page table

18
If 0 then page is not
present in memory

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Physical page number Page offset

Physical address

Advanced Computer Architecture pg 89


Size of page table
• Assume
– 40-bit virtual address; 32-bit physical
– 4 Kbyte pages; 4 bytes per page table entry (PTE)


Solution
 Size = Nentries * Size-of-entry = 2 40 / 2 12 * 4 bytes = 1 Gbyte


Reduce size:
 Dynamic allocation of page table entries
 Hashing: inverted page table
 1 entry per physical available instead of virtual page
 Page the page table itself (i.e. part of it can be on disk)
 Use larger page size (multiple page sizes)

Advanced Computer Architecture pg 90


Fast Translation Using a TLB
• Address translation would appear to require
extra memory references
– One to access the PTE (page table entry)
– Then the actual memory access

• However access to page tables has good locality


– So use a fast cache of PTEs within the CPU
– Called a Translation Look-aside Buffer (TLB)
– Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate
– Misses could be handled by hardware or software

Advanced Computer Architecture pg 91


Making Address Translation Fast
• A cache for address translations: translation lookaside
buffer (TLB)
TLB
Valid Tag Page address

1
1
Virtual page Physical memory
1
number 1
0
1

Physical page
Valid or disk address
1
1
1 Disk storage
1
0
1
1
0
Page table 1
1
0
1

Advanced Computer Architecture pg 92


TLBs and caches
Virtual address

TLB access

TLB miss No Yes


TLB hit?
exception Physical address

No Yes
Write?

Try to read data


from cache No Write access Yes
bit on?

Write protection
exception Write data into cache,
No Yes update the tag, and put
Cache miss stall Cache hit? the data and the address
into the write buffer

Deliver data
to the CPU

Advanced Computer Architecture pg 93


Overall operation of memory hierarchy
• Each instruction or data access can result in three
types of hits/misses: TLB, Page table, Cache
• Q: which combinations are possible?
Check them all! (see fig 5.26)
TLB Page table Cache Possible?
hit hit hit Yes, that’s what we want
hit hit miss Yes, but page table not checked if TLP hit
hit miss hit no
hit miss miss no
miss hit hit
miss hit miss
miss miss hit no
miss miss miss
Advanced Computer Architecture pg 94
AMR Cortex-A8 data caches/TLP.
Since the instruction and data hierarchies are symmetric, we show only one. The TLB (instruction
or data) is fully associative with 32 entries. The L1 cache is four-way set associative with 64-byte
blocks and 32 KB capacity. The L2 cache is eight-way set associative with 64-byte blocks and 1 MB
capacity. This figure doesn’t show the valid bits and protection bits for the caches and TLB, nor
the use of the way prediction bits that would dictate the predicted bank of the L1 cache.
Advanced Computer Architecture pg 95
Intel Nehalem (i7)
• 13.5 x 19.6 mm
• Per core:
– 731 Mtransistors
– 32-KB I & 32-KB data $
– 512 KB L2
– 2-level TLB
• Shared:
– 8 MB L3
– 2 128bit DDR3 channels

Advanced Computer Architecture pg 96


The Intel i7
memory hierarchy
The steps in both
instruction and data access.
We show only reads for
data. Writes are similar, in
that they begin with a read
(since caches are write
back). Misses are handled
by simply placing the data in
a write buffer, since the L1
cache is not write allocated.

Advanced Computer Architecture pg 97


Address translation and TLBs

Advanced Computer Architecture pg 98


Cache L1-L2-L3 organization

Advanced Computer Architecture pg 99


Virtual Machines
• Supports isolation and security
• Sharing a computer among many unrelated users
• Enabled by raw speed of processors, making the
overhead more acceptable

• Allows different operating systems to be presented to


user programs
– “System Virtual Machines”
– SVM software is called “virtual machine monitor” or “hypervisor”
– Individual virtual machines run under the monitor are called
“guest VMs”

Advanced Computer Architecture pg 100


Impact of VMs on Virtual Memory
• Each guest OS maintains its own set of page
tables
– VMM adds a level of memory between physical and
virtual memory called “real memory”
– VMM maintains shadow page table that maps guest
virtual addresses to physical addresses
• Requires VMM to detect guest’s changes to its own page
table
• Occurs naturally if accessing the page table pointer is a
privileged operation

Advanced Computer Architecture pg 101

You might also like