0% found this document useful (0 votes)

15 views101 pages

Ch2-MemoryHierarchyDesign Appb

This document provides an overview of memory hierarchy design and cache optimizations. It discusses the memory hierarchy from registers to main memory. It covers cache basics like block placement, identification, and replacement. It also summarizes six common cache optimizations: using larger block sizes, larger caches, higher associativity, multi-level caches, prioritizing read misses, and avoiding address translation during indexing to improve performance.

Uploaded by

max scalom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views101 pages

Ch2-MemoryHierarchyDesign Appb

Uploaded by

max scalom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 101

Advanced Computer

Architecture
Memory Hierarchy Design

Course 5MD00

Henk Corporaal
November 2013
[email protected]

Advanced Computer Architecture pg 1

Welcome!
This lecture:
• Memory Hierarchy Design
– Hierarchy
– Recap of Caching (App B)
– Many Cache and Memory Hierarchy Optimizations
– VM: virtual memory support
– AMR Cortex-A8 and Intel Core i7 examples

• Material:
– Book of Hennessy & Patterson
– appendix B
+ chapter 2:
• 2.1-2.6

Advanced Computer Architecture pg 2

Registers vs. Memory
• Arithmetic instructions operands must be registers,
— only 32 registers provided (Why?)
• Compiler associates variables with registers
• Question: what to do about programs with lots of
variables ?

Fast Slower Slowest

(2000Mhz) (500Mhz) (133Mhz)

Cache
CPU Memory Main
1MB Memory
register file
4 Gigabyte

32x4 =
128 byte
registerfile
Advanced Computer Architecture pg 3
Memory Hierarchy

Advanced Computer Architecture pg 4

Why does a small cache still work?
• LOCALITY

– Temporal: you are likely accessing the same address soon

again

– Spatial: you are likely accessing another address close to

the current one in the near future

Advanced Computer Architecture pg 5

Memory Performance Gap

Advanced Computer Architecture pg 6

Memory Hierarchy Design
• Memory hierarchy design becomes more crucial
with recent multi-core processors:
– Aggregate peak bandwidth grows with # cores:

– Intel Core i7 can generate two references per core per

clock
– Four cores and 3.2 GHz clock
• 25.6 billion 64-bit data references/second +
• 12.8 billion 128-bit instruction references
• = 409.6 GB/s!
– DRAM bandwidth is only 6% of this (25 GB/s)
– Requires:
• Multi-port, pipelined caches
• Two levels of cache per core
• Shared third-level cache on chip

Advanced Computer Architecture pg 7

Memory Hierarchy Basics

• Note that speculative and multithreaded

processors may execute other instructions
during a miss
– Reduces performance impact of misses

Advanced Computer Architecture pg 8

Cache operation
Memory / Lower level

Cache / Higher level

block / line

tags data

Advanced Computer Architecture pg 9

Direct Mapped Cache
• Mapping: address is modulo the number of blocks in the
cache
Cache

000
001
010
011

111
100
101
110

00001 00101 01001 01101 10001 10101 11001 11101

Memory

Advanced Computer Architecture pg 10

Review: Four Questions for Memory
Hierarchy Designers
• Q1: Where can a block be placed in the upper
level? (Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper
level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, FIFO, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)
Advanced Computer Architecture pg 11
Direct Mapped Cache
Address (bit positions)
31 30 13 12 11 2 10
Byte
offset

Q:What kind Hit

Tag
20 10
Data

of locality Index

are we taking Index Valid Tag Data

advantage 0

of? 1
2

1021
1022
1023
20 32

Advanced Computer Architecture pg 12

Direct Mapped Cache
• Taking advantage of spatial locality:
Address (bit bitpositions)
Address (showing positions)
31 16 15 4 32 1 0

16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data

4K
entries

16 32 32 32 32

Mux
32

Advanced Computer Architecture pg 13

Cache Basics
• cache_size = Nsets x Assoc x Block_size
• block_address = Byte_address DIV Block_size in
bytes
• index = Block_address MOD Nsets

• Because the block size and the number of sets are

(usually) powers of two, DIV and MOD can be performed
efficiently

block address

tag index block

offset
31 … …210

Advanced Computer Architecture pg 14

6 basic cache optimizations
(App. B.3)
• Reduces miss rate
1. Larger block size
2. Bigger cache
3. Associative cache (higher associativity)
• reduces conflict rate
• Reduce miss penalty
4. Multi-level caches
5. Give priority to read misses over write misses
• Reduce hit time
6. Avoid address translation during indexing of the cache

Advanced Computer Architecture pg 15

Improving Cache Performance
T = Ninstr * CPI * Tcycle
CPI (with cache) = CPI_base + CPI_cachepenalty
CPI_cachepenalty = .............................................

1. Reduce the miss rate

2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 16

1. Increase Block Size

25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0%
16

128

Block Size (bytes) 256

Advanced Computer Architecture pg 17

2. Larger Caches

• Increase capacity of cache

• Disadvantages :
– longer hit time (may determine processor cycle time!!)
– higher cost
– access requires more energy

Advanced Computer Architecture pg 18

3. Use / Increase Associativity

• Direct mapped caches have lots of conflict misses

• Example
– suppose a Cache with 128 entries, 4 words/entry
– Size is 128 x 16 = 2k Bytes
– Many addresses map to the same entry, e.g.
• Byte addresses 0-15, 2k - 2k+15, 4k - 4k+15, etc. all map to
entry 0
– What if program accesses repeatedly (in a loop) following
3 addresses: (0, 2k+4, and 4k+12) 

– they will all miss, although only 3 words of the cache are
really used !!

Advanced Computer Architecture pg 19

A 4-Way Set-Associative Cache
Address
31 30 12 11 10 9 8 321 0

22 8

Way 3
Index V Tag Data V Tag Data V Tag Data V Tag Data
0
1
2
Set 1
253
254
255
22 32

4-to-1 multiplexor

Hit Data
4-ways: Set contains 4 blocks
Fully associative cache contains 1 set, containing all blocks Advanced Computer Architecture pg 20
Example 1: cache calculations
• Assume
– Cache of 4K blocks
– 4 word block size
– 32 bit address
• Direct mapped (associativity=1) :
– 16 bytes per block = 2^4
– 32 bit address : 32-4=28 bits for index and tag
– #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index
– Total number of tag bits : (28-12)*4K=64 Kbits
• 2-way associative
– #sets=#blocks/associativity : 2K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-11) * 2 * 2K=68 Kbits
• 4-way associative
– #sets=#blocks/associativity : 1K sets
– 1 bit less for indexing, 1 bit more for tag
– Tag bits : (28-10) * 4 * 1K=72 Kbits
Advanced Computer Architecture pg 21
Example 2: cache mapping
• 3 caches consisting of 4 one-word blocks:

• Cache 1 : fully associative

• Cache 2 : two-way set associative
• Cache 3 : direct mapped

• Suppose following sequence of block addresses:

0, 8, 0, 6, 8

Advanced Computer Architecture pg 22

Example 2: Direct Mapped
Block address Cache Block
0 0 mod 4=0
6 6 mod 4=2
8 8 mod 4=0

Address of Hit or Location Location Location Location

memory block miss 0 1 2 3
0 miss Mem[0]
8 miss Mem[8]
0 miss Mem[0]
6 miss Mem[0] Mem[6]
8 miss Mem[8] Mem[6]

Coloured = new entry = miss

Advanced Computer Architecture pg 23
Example 2: 2-way Set Associative:
2 sets
Block address Cache Block
0 0 mod 2=0
(so all in set/location 0)
6 6 mod 2=0
8 8 mod 2=0

Address of Hit or SET 0 SET 0 SET 1 SET 1

memory block miss entry 0 entry 1 entry 0 entry 1
0 Miss Mem[0]
8 Miss Mem[0] Mem[8]
0 Hit Mem[0] Mem[8]
6 Miss Mem[0] Mem[6]
8 Miss Mem[8] Mem[6]

LEAST RECENTLY USED BLOCK

Advanced Computer Architecture pg 24

Example 2: Fully associative
(4 way assoc., 1 set)

Address of Hit or Block 0 Block 1 Block 2 Block 3

memory block miss
0 Miss Mem[0]

8 Miss Mem[0] Mem[8]

0 Hit Mem[0] Mem[8]

6 Miss Mem[0] Mem[8] Mem[6]

8 Hit Mem[0] Mem[8] Mem[6]

Advanced Computer Architecture pg 25

Classifying Misses: the 3 Cs

• The 3 Cs:
– Compulsory—First access to a block is always a
miss. Also called cold start misses
• misses in infinite cache

– Capacity—Misses resulting from the finite

capacity of the cache
• misses in fully associative cache with optimal replacement strategy

– Conflict—Misses occurring because several blocks

map to the same set. Also called collision misses
• remaining misses

Advanced Computer Architecture pg 26

3 Cs: Compulsory, Capacity, Conflict
In all cases, assume total cache size not changed

What happens if we:

1) Change Block Size:
Which of 3Cs is obviously affected? compulsory
2) Change Cache Size:
Which of 3Cs is obviously affected? capacity
misses
3) Introduce higher associativity :
Which of 3Cs is obviously affected? conflict
misses

Advanced Computer Architecture pg 27

3Cs Absolute Miss Rate (SPEC92)

0.14
1-way
0.12
2-way
0.1 Conflict
Miss Rate per Type

4-way
Miss rate per type 0.08
8-way
0.06
Capacity
0.04
0.02
0
1

128
Cache Size (KB) Compulsory

Advanced Computer Architecture pg 28

3Cs Relative Miss Rate

100%
1-way
80% Conflict
2-way
4-way
Miss Rate per Type

60% 8-way
Miss rate per type

40%
Capacity

20%

0%
1

128
Compulsory
Cache Size (KB)
Advanced Computer Architecture pg 29
Improving Cache Performance
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 30

4. Second Level Cache (L2)
• Most CPUs
– have an L1 cache small enough to match the cycle time (reduce
the time to hit the cache)
– have an L2 cache large enough and with sufficient associativity
to capture most memory accesses (reduce miss rate)

• L2 Equations, Average Memory Access Time (AMAT):

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss
PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total
number of memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total
number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Advanced Computer Architecture pg 31
4. Second Level Cache (L2)
• Suppose processor with base CPI of 1.0
• Clock rate of 500 Mhz
• Main memory access time : 200 ns
• Miss rate per instruction primary cache : 5%
What improvement with second cache having 20ns access time,
reducing miss rate to memory to 2% ?

• Miss penalty : 200 ns/ 2ns per cycle=100 clock cycles

• Effective CPI=base CPI+ memory stall per instruction = ?
– 1 level cache : total CPI=1+5%*100=6
– 2 level cache : a miss in first level cache is satisfied by second cache or
memory
• Access second level cache : 20 ns / 2ns per cycle=10 clock cycles
• If miss in second cache, then access memory : in 2% of the cases
• Total CPI=1+primary stalls per instruction +secondary stalls per instruction
• Total CPI=1+5%*10+2%*100=3.5

Machine with L2 cache : 6/3.5=1.7 times faster

Advanced Computer Architecture pg 32
4. Second Level Cache

• Global cache miss is similar to single cache miss rate of second

level cache provided L2 cache is much bigger than L1.
• Local cache rate is NOT good measure of secondary caches as it is
function of L1 cache.
Global cache miss rate should be used.
Advanced Computer Architecture pg 33
4. Second Level Cache

Advanced Computer Architecture pg 34

5. Read Priority over Write on Miss
• Write-through with write buffers can cause RAW data
hazards:
SW 512(R0),R3 ; Mem[512] = R3 Map to same
LW R1,1024(R0) ; R1 = Mem[1024] cache block
LW R2,512(R0) ; R2 = Mem[512]

• Problem: if write buffer used, final LW may read wrong

value from memory !!

• Solution 1 : Simply wait for write buffer to empty

– increases read miss penalty (old MIPS 1000 by 50% )
• Solution 2 : Check write buffer contents before read: if
no conflicts, let read continue

Advanced Computer Architecture pg 35

5. Read Priority over Write on Miss
What about write-back?
• Dirty bit: whenever a write is cached, this bit is
set (made a 1) to tell the cache controller "when
you decide to re-use this cache line for a
different address, you need to write the current
contents back to memory”
What if read-miss:
• Normal: Write dirty block to memory, then do
the read
• Instead: Copy dirty block to a write buffer, then
do the read, then the write
• Fewer CPU stalls since restarts as soon as read
done
Advanced Computer Architecture pg 36
Improving Cache Performance
1. Reduce the miss rate
2. Reduce the miss penalty
3. Reduce the time to hit in the cache

Advanced Computer Architecture pg 37

6. No address translation during cache access

Advanced Computer Architecture pg 38

11 Advanced Cache Optimizations (2.2)
• Reducing hit time • Reducing Miss
1.Small and simple Penalty
caches 7. Critical word first
2.Way prediction 8. Merging write
buffers
3.Trace caches • Reducing Miss Rate
9. Compiler
• Increasing cache optimizations
bandwidth • Reducing miss
4.Pipelined caches penalty or miss rate
5.Multibanked caches via parallelism
6.Nonblocking caches 10.Hardware
prefetching
11.Compiler prefetching
Advanced Computer Architecture pg 39
1. Small and simple first level caches
• Critical timing path:
–addressing tag memory, then
–comparing tags, then
–selecting correct set

• Direct-mapped caches can overlap tag

compare and transmission of data

• Lower associativity reduces power because

–fewer cache lines are accessed, and
–less complex mux to select the right way

Advanced Computer Architecture pg 40

Recap: 4-Way Set-Associative Cache
Address
31 30 12 11 10 9 8 321 0

8
22
Way 3

Index V Tag Data V Tag Data V Tag Data V Tag Data

0
1
2 Set 2
253
254
255
22 32

4-to-1 multiplexor

Hit Data

Advanced Computer Architecture pg 41

L1 Size and Associativity

Access time vs. size and associativity

Advanced Computer Architecture pg 42

L1 Size and Associativity

Energy per read vs. size and associativity

Advanced Computer Architecture pg 43

2. Fast Hit via Way Prediction
• Make set-associative caches faster
• Keep extra bits in cache to predict the “way,” or block
within the set, of next cache access.
– Multiplexor is set early to select desired block, only 1 tag
comparison performed
– Miss  first check other blocks for matches in next clock cycle
• Accuracy  85%
• Saves also energy

• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

Hit Time

Way-Miss Hit Time Miss Penalty

Advanced Computer Architecture pg 44

Way Predicting Instruction Cache
(Alpha 21264-like)
Jump target

0x4
Jump Add
control

addr inst
Primary
Instruction
way Cache

Sequential Way
Branch Target Way

Advanced Computer Architecture pg 45

3. Fast (Inst. Cache) Hit via Trace Cache
Key Idea: Pack multiple non-contiguous basic blocks
into one contiguous trace cache line

instruction BR BR BR
trace:

trace cache line: BR BR BR

• Single fetch brings in multiple basic blocks

• Trace cache indexed by start address and next n
branch predictions

Advanced Computer Architecture pg 46

3. Fast Hit times via Trace Cache
• Trace cache in Pentium 4 and its successors
 Dynamic instr. traces cached (in level 1 cache)
 Cache the micro-ops vs. x86 instructions
• Decode/translate from x86 to micro-ops on trace cache
miss

+ better utilize long blocks (don’t exit in middle of

block, don’t enter at label in middle of block)
- complicated address mapping since addresses no
longer aligned to power-of-2 multiples of word
size
- instructions may appear multiple times in
multiple dynamic traces due to different branch
outcomes
Advanced Computer Architecture pg 47
4. Pipelining Cache
• Pipeline cache access to improve bandwidth
– Examples:
• Pentium: 1 cycle
• Pentium Pro – Pentium III: 2 cycles
• Pentium 4 – Core i7: 4 cycles

• Increases branch mis-prediction penalty

• Makes it easier to increase associativity

Advanced Computer Architecture pg 48

5. Multi-banked Caches
• Organize cache as independent banks to
support simultaneous access
– ARM Cortex-A8 supports 1-4 banks for L2
– Intel i7 supports 4 banks for L1 and 8 banks for
L2

• Interleave banks according to block address

Advanced Computer Architecture pg 49

5. Multi-banked caches
• Banking works best when accesses naturally
spread themselves across banks  mapping of
addresses to banks affects behavior of memory
system

• Simple mapping that works well is “sequential

interleaving”
– Spread block addresses sequentially across banks
– E.g., with 4 banks,
• Bank 0 has all blocks with address%4 = 0;
• Bank 1 has all blocks whose address%4 = 1; …

Advanced Computer Architecture pg 50

6. Nonblocking Caches
• Allow hits before previous misses complete
– “Hit under miss”
– “Hit under multiple miss”
• L2 must support this
• In general, processors can hide L1 miss penalty but not
L2 miss penalty

• Requires OoO processor

• Makes cache control much more complex

Advanced Computer Architecture pg 51

Non-blocking cache

Advanced Computer Architecture pg 52

7. Critical Word First, Early Restart
• Critical word first
– Request missed word from memory first
– Send it to the processor as soon as it arrives
• Early restart
– Request words in normal order
– Send missed work to the processor as soon as it
arrives

• Effectiveness of these strategies depends

on block size and likelihood of another
access to the portion of the block that has
not yet been fetched

Advanced Computer Architecture pg 53

8. Merging Write Buffer
• When storing to a block that is already pending in
the write buffer, update write buffer
• Reduces stalls due to full write buffer
• Do not apply to I/O addresses

No write
buffering

Write
buffering

Advanced Computer Architecture pg 54

9. Compiler Optimizations
• Loop Interchange
– Swap nested loops to access memory in
sequential order

• Blocking
– Instead of accessing entire rows or columns,
subdivide matrices into blocks
– Requires more memory accesses but improves
locality of accesses

Advanced Computer Architecture pg 55

9. Reducing Misses by Compiler Optimizations
• Instructions
– Reorder procedures in memory so as to reduce
conflict misses
– Profiling to look at conflicts (using developed tools)
• Data
– Merging Arrays: improve spatial locality by single
array of compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access
data in order stored in memory
– Loop Fusion: combine 2 independent loops that have
same looping and some variables overlap
– Blocking: Improve temporal locality by accessing
“blocks” of data repeatedly vs. going down whole
columns or rows
• Huge miss reductions possible !!
Advanced Computer Architecture pg 56
Merging Arrays
int val[SIZE]; struct record{
int key[SIZE]; int val;
int key;
for (i=0; i<SIZE; i++){ };
key[i] = newkey; struct record records[SIZE];
val[i]++;
} for (i=0; i<SIZE; i++){
records[i].key = newkey;
records[i].val++;
}

• Reduces conflicts between val & key and improves

spatial locality

Advanced Computer Architecture pg 57

Loop Interchange

for (col=0; col<100; col++)

columns
for (row=0; row<5000; row++)
X[row][col] = X[row][col+1];

rows
for (row=0; row<5000; row++) array X
for (col=0; col<100; col++)
X[row][col] = X[row][col+1];

• Sequential accesses instead of striding

through memory every 100 words
• Improves spatial locality
Advanced Computer Architecture pg 58
Loop Fusion
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];

for (i = 0; i < N; i++) Reference can be directly to register

for (j = 0; j < N; j++){
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}

Splitted loops: every access to a and c misses. Fused loops:

only 1st access misses. Improves temporal locality
Advanced Computer Architecture pg 59
Blocking (Tiling) applied to array
multiplication
for (i=0; i<N; i++)
for (j=0; j<N; j++){
c[i][j] = 0.0; c
for (k=0; k<N; k++)
c[i][j] += a[i][k]*b[k][j];
=
}

• The two inner loops: a

– Read all NxN elements of b
– Read all N elements of one row of a repeatedly
x
– Write all N elements of one row of c
• If a whole matrix does not fit in the
cache many cache misses result. b
• Idea: compute on BxB submatrix that
fits in the cache
Advanced Computer Architecture pg 60
Blocking Example

for (ii=0; ii<N; ii+=B)

for (jj=0; jj<N; jj+=B) c
for (i=ii; i<min(ii+B-1,N); i++)
for (j=jj; j<min(jj+B-1,N); j++){
c[i][j] = 0.0; =
for (k=0; k<N; k++)
c[i][j] += a[i][k]*b[k][j];
} a

• B is called Blocking Factor

x
• Can reduce capacity misses from 2N3 +
N2 to 2N3/B +N2
b

Advanced Computer Architecture pg 61

Reducing Conflict Misses by Blocking
• Conflict misses in caches vs. Blocking size
– Lam et al [1991]: a blocking factor of 24 had a fifth the
misses compared to 48, despite both fit in cache

0.15

0.1
Miss Rate

Direct Mapped Cache

0.05

Fully Associative Cache

0
0 50 100 150
Blocking Factor Advanced Computer Architecture pg 62
Summary of Compiler Optimizations to
Reduce Cache Misses (by hand)

vpenta (nasa7)

gmty (nasa7)

tomcatv

btrix (nasa7)

mxm (nasa7)

spice

cholesky (nasa7)

compress

1 1.5 2 2.5 3
Performance Improvement

merged arrays loop interchange loop fusion blocking

Advanced Computer Architecture pg 63

10. Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch block (b + 1) upon miss on b

• One Block Lookahead (OBL) scheme

– Initiate prefetch for block (b + 1) when block b is accessed
– Why is this different from doubling block size?
– Can extend to N block lookahead

• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then
prefetch b+3N etc.

• Example: IBM Power 5 [2003] supports eight

independent streams of strided prefetch per processor,
prefetching 12 lines ahead of current access

• Note: instructions are usually prefetched in instr. buffer

Advanced Computer Architecture pg 64
10. Hardware Prefetching
• Fetch two blocks on miss (include next
sequential block)

Pentium 4 Pre-fetching
Advanced Computer Architecture pg 65
Issues in HW Prefetching
• Usefulness – should produce hits
– if you are unlucky, the pretetched data/instr is not needed
• Timeliness – not too late and not too early
• Cache and bandwidth pollution

L1
Instruction
CPU Unified L2
Cache
RF L1 Data

Prefetched data

Advanced Computer Architecture pg 66

Issues in HW prefetching: stream buffer
• Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i)
and the next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move
stream buffer block into cache and prefetch next
block (i+2)
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
Instruction Req Cache
RF block
Advanced Computer Architecture pg 67
11. Compiler Prefetching
• Insert prefetch instructions before data is needed
• Non-faulting: prefetch doesn’t cause exceptions

• Register prefetch
– Loads data into register
• Cache prefetch
– Loads data into cache

• Combine with loop unrolling and software pipelining

• Cost of prefetching: more bandwidth (speculation) !!

Advanced Computer Architecture pg 68

Hit Band- Miss Miss HW cost/
Technique Comment
Time width penalty rate complexity

Small and simple caches + – 0 Trivial; widely used

Way-predicting caches + 1 Used in Pentium 4
Trace caches + 3 Used in Pentium 4
Pipelined cache access – + 1 Widely used
Nonblocking caches + + 3 Widely used

Banked caches Used in L2 of Opteron and

+ 1 Niagara
Critical word first and early
restart + 2 Widely used
Merging write buffer + 1 Widely used with write through
Compiler techniques to reduce Software is a challenge; some
cache misses + 0 computers have compiler option
Hardware prefetching of 2 instr. Many prefetch instructions;
instructions and data + + 3 data AMD Opteron prefetches data

Compiler-controlled prefetching Needs nonblocking cache; in

+ + 3 many CPUs

Advanced Computer Architecture pg 69

Memory Technology
• Performance metrics
– Latency is concern of cache
– Bandwidth is concern of multiprocessors and
I/O
– Access time
• Time between read request and when desired word
arrives
– Cycle time
• Minimum time between unrelated requests to memory

• DRAM used for main memory, SRAM used

for cache

Advanced Computer Architecture pg 70

Memory Technology
• SRAM
– Requires low power to retain bit
– Requires 6 transistors/bit

• DRAM
– Must be re-written after being read
– Must also be periodically refeshed
• Every ~ 8 ms
• Each row can be refreshed simultaneously
– One transistor/bit
– Address lines are multiplexed:
• Upper half of address: row access strobe (RAS)
• Lower half of address: column access strobe (CAS)

Advanced Computer Architecture pg 71

Memory Technology
• Amdahl:
– Memory capacity should grow linearly with processor
speed
– Unfortunately, memory capacity and speed has not kept
pace with processors

• Some optimizations:
– Multiple accesses to same row
– Synchronous DRAM
• Added clock to DRAM interface
• Burst mode with critical word first
– Wider interfaces
– Double data rate (DDR)
– Multiple banks on each DRAM device

Advanced Computer Architecture pg 72

SRAM vs DRAM
Static Random Access Memory Dynamic Random Access Memory

► Bitlines driven by transistors ► A bit

is stored as charge on the
- Fast (10x) capacitor
► Bit
cell loses charge over time
► 1 transistor and 1 capacitor vs.
6 transistors (read operation and circuit
– Large (~6-10x) leakage)
- Must periodically refresh
- Hence the name Dynamic RAM

Credits: J.Leverich, Stanford Advanced Computer Architecture pg 73

DRAM: Internal architecture

Bank 4
Bank 3
Bank 2
Bank 1
Address register

decoder

Address MS bits • Bit cells are arranged to

Row

Memory Array
Row Buffer
Row Buffer form a memory array
Row Buffer
• Multiple arrays are
organized as different
Sense amplifiers
(row buffer) banks
– Typical number of
LS bits Column banks are 4, 8 and 16
decoder
• Sense amplifiers raise
the voltage level on the
Data
bitlines to read the data
out

Credits: J.Leverich, Stanford Advanced Computer Architecture pg 74

Memory Optimizations

Advanced Computer Architecture pg 75

Memory Optimizations

Advanced Computer Architecture pg 76

Memory Optimizations
• DDR:
– DDR2
• Lower power (2.5 V -> 1.8 V)
• Higher clock rates (266 MHz, 333 MHz, 400 MHz)
– DDR3
• 1.5 V
• 800 MHz
– DDR4
• 1-1.2 V
• 1600 MHz

• GDDR5 is graphics memory based on DDR3

Advanced Computer Architecture pg 77

Memory Optimizations
• Graphics memory:
– Achieve 2-5 X bandwidth per DRAM vs. DDR3
• Wider interfaces (32 vs. 16 bit)
• Higher clock rate
– Possible because they are attached via soldering instead of socketted
DIMM modules

• Reducing power in SDRAMs:

– Lower voltage
– Low power mode (ignores clock, continues to refresh)

Advanced Computer Architecture pg 78

Memory Power Consumption

Advanced Computer Architecture pg 79

Flash Memory
• Type of EEPROM
–(Electrical Erasable Programmable Read Only
Memory)
• Must be erased (in blocks) before being
overwritten
• Non volatile
• Limited number of write cycles
• Cheaper than SDRAM, more expensive than disk
• Slower than SRAM, faster than disk

Advanced Computer Architecture pg 80

Memory Dependability
• Memory is susceptible to cosmic rays
• Soft errors: dynamic errors
– Detected and fixed by error correcting codes (ECC)
• Hard errors: permanent errors
– Use sparse rows to replace defective rows

• Chipkill: a RAID-like error recovery technique

Advanced Computer Architecture pg 81

Virtual Memory
• Protection via virtual memory
– Keeps processes in their own memory
space

• Role of architecture:
– Provide user mode and supervisor mode
– Protect certain aspects of CPU state
– Provide mechanisms for switching
between user and supervisor mode
– Provide mechanisms to limit memory
accesses
• read-only pages
• executable pages
• shared pages
– Provide TLB to translate addresses
Advanced Computer Architecture pg 82
Memory organization
• The operating system, together with the MMU hardware, take
care of separating the programs.
• Each program runs in its own ‘virtual’ environment, and uses logical
addressing that is (often) different the the actual physical
addresses.

• Within the virtual world of a program, the full 4 Gigabytes

address space is available. (Less under Windows)

• In the von Neumann architecture, we need to manage the memory

space to store the following:
Main memory
– The machine code of the program
– The data:
• Global variables and constants Program
• The stack/local variables +
• The heap Data

Advanced Computer Architecture pg 83

Memory Organization: more detail
The memory that is reserved 0xFFFFFFFF
by the memory manager
Variable
Heap size
If the heap and the
stack collide, we’re out
of memory
The local variables in the
routines. With each routine Free memory
call, a new set of variables
if put in the stack.
Stack pointer
Before the first line Variable
of the program is run, Stack size
all global variables and Fixed
constants are initialized. Global variables size

The program itself: Fixed

a set of machine Machine code size
instructions.
This is in the .exe 0x00000000
Advanced Computer Architecture pg 84
Memory management
• Problem: many programs run simultaneously
• MMU manages the memory access.
Swap file
on
Memory Management Unit hard disk

No: access
2K block
violation 2K block
No: load 2K block 2K block
Yes: 2K block
Logical Virtual from swap file 2K block
CPU Process Physical
address address Memory on disk Main memory
table
Manager Yes:
PhysicalPhysical
addressaddress
Cache 2K block
memory 2K block
2K block
Each program thinks
Checks whether the
that it owns all the
requested address
memory.
is ‘in core’

Advanced Computer Architecture pg 85

Virtual Memory
• Main memory can act as a cache for the secondary
storage (disk)
virtual memory
Virtual addresses physical memory
Physical addresses
Address translation

Disk addresses


Advantages:
 illusion of having more physical memory
 program relocation
 protection
Advanced Computer Architecture pg 86
Pages: virtual memory blocks
• Page faults: the data is not in memory, retrieve it from
disk
– huge miss penalty, thus pages should be fairly large (e.g.,
4KB)
– reducing page faults is important (LRU is worth the price)
– can handle the faults in software instead of hardware
– using write-through is tooVirtual
expensive
address so we use writeback
31 30 29 28 27 15 14 13 12 11 10 9 8 3210

Virtual page number Page offset

Translation

29 28 27 15 14 13 12 11 10 9 8 3210

Physical page number Page offset

Physical address

Advanced Computer Architecture pg 87

Page Tables
Virtual page
number
Page table
Physical memory
Physical page or
Valid disk address

1
1
1
1
0
1
1
0
1 Disk storage
1
0
1

Advanced Computer Architecture pg 88

Page Tables
Page table register

Virtual address
31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Virtual page number Page offset

20 12

Valid Physical page number

Page table

18
If 0 then page is not
present in memory

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Physical page number Page offset

Physical address

Advanced Computer Architecture pg 89

Size of page table
• Assume
– 40-bit virtual address; 32-bit physical
– 4 Kbyte pages; 4 bytes per page table entry (PTE)


Solution
 Size = Nentries * Size-of-entry = 2 40 / 2 12 * 4 bytes = 1 Gbyte


Reduce size:
 Dynamic allocation of page table entries
 Hashing: inverted page table
 1 entry per physical available instead of virtual page
 Page the page table itself (i.e. part of it can be on disk)
 Use larger page size (multiple page sizes)

Advanced Computer Architecture pg 90

Fast Translation Using a TLB
• Address translation would appear to require
extra memory references
– One to access the PTE (page table entry)
– Then the actual memory access

• However access to page tables has good locality

– So use a fast cache of PTEs within the CPU
– Called a Translation Look-aside Buffer (TLB)
– Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100
cycles for miss, 0.01%–1% miss rate
– Misses could be handled by hardware or software

Advanced Computer Architecture pg 91

Making Address Translation Fast
• A cache for address translations: translation lookaside
buffer (TLB)
TLB
Valid Tag Page address

1
1
Virtual page Physical memory
1
number 1
0
1

Physical page
Valid or disk address
1
1
1 Disk storage
1
0
1
1
0
Page table 1
1
0
1

Advanced Computer Architecture pg 92

TLBs and caches
Virtual address

TLB access

TLB miss No Yes

TLB hit?
exception Physical address

No Yes
Write?

Try to read data

from cache No Write access Yes
bit on?

Write protection
exception Write data into cache,
No Yes update the tag, and put
Cache miss stall Cache hit? the data and the address
into the write buffer

Deliver data
to the CPU

Advanced Computer Architecture pg 93

Overall operation of memory hierarchy
• Each instruction or data access can result in three
types of hits/misses: TLB, Page table, Cache
• Q: which combinations are possible?
Check them all! (see fig 5.26)
TLB Page table Cache Possible?
hit hit hit Yes, that’s what we want
hit hit miss Yes, but page table not checked if TLP hit
hit miss hit no
hit miss miss no
miss hit hit
miss hit miss
miss miss hit no
miss miss miss
Advanced Computer Architecture pg 94
AMR Cortex-A8 data caches/TLP.
Since the instruction and data hierarchies are symmetric, we show only one. The TLB (instruction
or data) is fully associative with 32 entries. The L1 cache is four-way set associative with 64-byte
blocks and 32 KB capacity. The L2 cache is eight-way set associative with 64-byte blocks and 1 MB
capacity. This figure doesn’t show the valid bits and protection bits for the caches and TLB, nor
the use of the way prediction bits that would dictate the predicted bank of the L1 cache.
Advanced Computer Architecture pg 95
Intel Nehalem (i7)
• 13.5 x 19.6 mm
• Per core:
– 731 Mtransistors
– 32-KB I & 32-KB data $
– 512 KB L2
– 2-level TLB
• Shared:
– 8 MB L3
– 2 128bit DDR3 channels

Advanced Computer Architecture pg 96

The Intel i7
memory hierarchy
The steps in both
instruction and data access.
We show only reads for
data. Writes are similar, in
that they begin with a read
(since caches are write
back). Misses are handled
by simply placing the data in
a write buffer, since the L1
cache is not write allocated.

Advanced Computer Architecture pg 97

Address translation and TLBs

Advanced Computer Architecture pg 98

Cache L1-L2-L3 organization

Advanced Computer Architecture pg 99

Virtual Machines
• Supports isolation and security
• Sharing a computer among many unrelated users
• Enabled by raw speed of processors, making the
overhead more acceptable

• Allows different operating systems to be presented to

user programs
– “System Virtual Machines”
– SVM software is called “virtual machine monitor” or “hypervisor”
– Individual virtual machines run under the monitor are called
“guest VMs”

Advanced Computer Architecture pg 100

Impact of VMs on Virtual Memory
• Each guest OS maintains its own set of page
tables
– VMM adds a level of memory between physical and
virtual memory called “real memory”
– VMM maintains shadow page table that maps guest
virtual addresses to physical addresses
• Requires VMM to detect guest’s changes to its own page
table
• Occurs naturally if accessing the page table pointer is a
privileged operation

Advanced Computer Architecture pg 101

Sap Basis Administration Procedure Manual PDF
100% (2)
Sap Basis Administration Procedure Manual PDF
174 pages
Introduction To Networks, Reference Models
100% (2)
Introduction To Networks, Reference Models
38 pages
CAO - Lecutre7 Cache Memory
100% (1)
CAO - Lecutre7 Cache Memory
39 pages
Computer Organization and Architecture: Cache Memory
100% (1)
Computer Organization and Architecture: Cache Memory
57 pages
Nva100 Mb0 Manual PR 0300
100% (1)
Nva100 Mb0 Manual PR 0300
63 pages
MCA - HW - Lectures 3 and 4 - Prelim
No ratings yet
MCA - HW - Lectures 3 and 4 - Prelim
157 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
Unit 4
No ratings yet
Unit 4
72 pages
PCU20 Series Startup
No ratings yet
PCU20 Series Startup
6 pages
55-Types of Caches, Caches Misses,-04!03!2025
No ratings yet
55-Types of Caches, Caches Misses,-04!03!2025
64 pages
Week12 Updated
No ratings yet
Week12 Updated
60 pages
Unit Iv
No ratings yet
Unit Iv
61 pages
Cache 1 54
No ratings yet
Cache 1 54
54 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Cache Memory
No ratings yet
Cache Memory
51 pages
Computer Architecture&O (ECEG-3163) 08 Cache Memory
No ratings yet
Computer Architecture&O (ECEG-3163) 08 Cache Memory
43 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
6.module 2 - Part 2
No ratings yet
6.module 2 - Part 2
39 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Lecture 13 - Introduction To Cache
No ratings yet
Lecture 13 - Introduction To Cache
47 pages
EE6304 Lecture9 Mem Caches
No ratings yet
EE6304 Lecture9 Mem Caches
61 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
CA Lecture 08
No ratings yet
CA Lecture 08
38 pages
CS2115 Chapter-6
No ratings yet
CS2115 Chapter-6
45 pages
Week 13 - Lecture 13 - Memory (Cont)
No ratings yet
Week 13 - Lecture 13 - Memory (Cont)
31 pages
Ch01 Part3 Caches
No ratings yet
Ch01 Part3 Caches
32 pages
Ch01 Part3 Caches
No ratings yet
Ch01 Part3 Caches
32 pages
Ch01 Part3 Caches
No ratings yet
Ch01 Part3 Caches
32 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
Cache
No ratings yet
Cache
36 pages
10 Cache
No ratings yet
10 Cache
28 pages
Chapter 5.1-5.6 Memory
No ratings yet
Chapter 5.1-5.6 Memory
26 pages
CS252 Graduate Computer Architecture Caches and Memory Systems I
No ratings yet
CS252 Graduate Computer Architecture Caches and Memory Systems I
49 pages
Digitaldesign Partialsolution
No ratings yet
Digitaldesign Partialsolution
132 pages
Lec8 - Caches
No ratings yet
Lec8 - Caches
55 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
76 pages
Cache Org
No ratings yet
Cache Org
19 pages
CMSC 611: Advanced Computer Architecture
No ratings yet
CMSC 611: Advanced Computer Architecture
21 pages
Cache Basics and Operation
No ratings yet
Cache Basics and Operation
42 pages
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 8th Edition Cache Memory
71 pages
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
No ratings yet
Kien-Truc-May-Tinh-Nang-Cao - Tran-Ngoc-Thinh - Lec04-Cache - (Cuuduongthancong - Com)
16 pages
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
No ratings yet
Miss Rate Versus Block Size: 25% 1K 4K 16K 64K 256K
33 pages
Lecture#3 - Memory Hierarchy
No ratings yet
Lecture#3 - Memory Hierarchy
24 pages
Cache Design
No ratings yet
Cache Design
59 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
Today: - How Do Caches Work?
No ratings yet
Today: - How Do Caches Work?
38 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
No ratings yet
Course Code: CS 283 Course Title: Computer Architecture: Class Day: Friday Timing: 12:00 To 1:30
23 pages
361 Computer Architecture Lecture 14: Cache Memory
No ratings yet
361 Computer Architecture Lecture 14: Cache Memory
20 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
Large and Fast: Exploiting Memory Hierarchy
No ratings yet
Large and Fast: Exploiting Memory Hierarchy
48 pages
Cache
No ratings yet
Cache
34 pages
AC14L08 Memory Hierarchy
No ratings yet
AC14L08 Memory Hierarchy
20 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
MTP 01 Final J.raghunat b15216
No ratings yet
MTP 01 Final J.raghunat b15216
10 pages
Basic Philosophy: Cache Memory
No ratings yet
Basic Philosophy: Cache Memory
16 pages
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
No ratings yet
Question: Who Cares About The Memory Hierarchy?: Caches and Memory Systems I
13 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
8 Network Devices
No ratings yet
8 Network Devices
55 pages
SY Microcontroller Unit 1 Notes-1
No ratings yet
SY Microcontroller Unit 1 Notes-1
86 pages
A Quick Notes On C Programming
No ratings yet
A Quick Notes On C Programming
132 pages
GIS Q1: Define Gis
No ratings yet
GIS Q1: Define Gis
16 pages
Mac OS X:: VI Keyboard Shortcut Cheat Sheet: Navigation Shortcuts Editing Exiting (Command Mode)
No ratings yet
Mac OS X:: VI Keyboard Shortcut Cheat Sheet: Navigation Shortcuts Editing Exiting (Command Mode)
1 page
Sap Table Ref
No ratings yet
Sap Table Ref
18 pages
ORACLE Database Discussions: Generating Password Protected PDF Document Usuing As - PDF - Mini Package - SQL & PL/SQL
No ratings yet
ORACLE Database Discussions: Generating Password Protected PDF Document Usuing As - PDF - Mini Package - SQL & PL/SQL
36 pages
Distributed System
No ratings yet
Distributed System
34 pages
Linksys RV042 Load-Balanced Router (KB004)
No ratings yet
Linksys RV042 Load-Balanced Router (KB004)
8 pages
Lab 1: LINQ Project: Unified Language Features For Object and Relational Queries
No ratings yet
Lab 1: LINQ Project: Unified Language Features For Object and Relational Queries
11 pages
Samc2090 320
No ratings yet
Samc2090 320
4 pages
Wireless Hacking Haifux Wireless Hacking
No ratings yet
Wireless Hacking Haifux Wireless Hacking
70 pages
A Scalable Video Streaming Approach Using Distributed B-Tree
No ratings yet
A Scalable Video Streaming Approach Using Distributed B-Tree
50 pages
Practical Android
No ratings yet
Practical Android
89 pages
It TG Normalisation To Third Normal Form 9626
No ratings yet
It TG Normalisation To Third Normal Form 9626
19 pages
DBMS Practice Questions Gate 2025
No ratings yet
DBMS Practice Questions Gate 2025
15 pages
Opencable™ Specifications: Issued
No ratings yet
Opencable™ Specifications: Issued
31 pages
Lab1 LinkedList
No ratings yet
Lab1 LinkedList
2 pages
Question Bank: Unit-1
No ratings yet
Question Bank: Unit-1
10 pages
See How Easily You Can Improve Performance by Using These Five Data Caching Techniques
No ratings yet
See How Easily You Can Improve Performance by Using These Five Data Caching Techniques
13 pages
What Is Arinc 818 PDF
No ratings yet
What Is Arinc 818 PDF
5 pages
Question Bank
No ratings yet
Question Bank
4 pages
True or False? The Statement
No ratings yet
True or False? The Statement
9 pages
DT 01 Aca
No ratings yet
DT 01 Aca
2 pages