0% found this document useful (0 votes)
18 views54 pages

Cache 1 54

This document summarizes the key aspects of memory hierarchy design. It discusses how memory is organized in a hierarchy from fastest but smallest levels like CPU caches to largest but slowest main memory. The goal is to optimize access latency and throughput by exploiting locality. Cache placement policies like direct-mapped, set-associative and fully-associative caching are covered along with concepts like cache hits, misses and block size. Optimization of caches, main memory and the memory hierarchy is important for performance in modern multi-core processors.

Uploaded by

thk jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views54 pages

Cache 1 54

This document summarizes the key aspects of memory hierarchy design. It discusses how memory is organized in a hierarchy from fastest but smallest levels like CPU caches to largest but slowest main memory. The goal is to optimize access latency and throughput by exploiting locality. Cache placement policies like direct-mapped, set-associative and fully-associative caching are covered along with concepts like cache hits, misses and block size. Optimization of caches, main memory and the memory hierarchy is important for performance in modern multi-core processors.

Uploaded by

thk jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Memory Hierarchy Design

1
Contents
1. Memory hierarchy
1. Basic concepts
2. Design techniques
2. Caches
1. Types of caches: Fully associative, Direct mapped, Set associative
2. Ten optimization techniques
3. Main memory
1. Memory technology
2. Memory optimization
3. Power consumption
4. Memory hierarchy case studies: Opteron, Pentium, i7.
5. Virtual memory
6. Problem solving

dcm 2
Introduction
Introduction
 Programmers want very large memory with low latency
 Fast memory technology is more expensive per bit than
slower memory
 Solution: organize memory system into a hierarchy
 Entire addressable memory space available in largest, slowest
memory
 Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
 Temporal and spatial locality insures that nearly all
references can be found in smaller memories
 Gives the allusion of a large, fast memory being presented to the
processor

Copyright © 2012, Elsevier Inc. All rights reserved. 3


Memory hierarchy

Processor

L1 Cache

L2 Cache
Latency

L3

Cache

Main Memory

Hard Drive or Flash

Capacity (KB, MB,


GB,Copyright
TB) © 2012, Elsevier Inc. All rights reserved. 4
PROCESSOR

L1: I-Cache D-Cache I-Cache  instruction cache


D-Cache data cache
U-Cache  unified cache
L2: U-Cache Different functional units fetch
information from I-cache and
D-cache: decoder and scheduler
L3: U-Cache operate with I-
cache, but integer execution
unit and floating-point unit
Main: communicate with D-cache.
Main Memory

Copyright © 2012, Elsevier Inc. All rights reserved. 5


Introduction
Memory hierarchy

 My Power Book
 Intel core i7
 2 cores
 2.8 GHz
 L2 cache:
256 KB/core
 L3 4MB
 Main
memory 16
GB
two DDR3 8 MB
at 1.6 GHz
 Disk 500 GB

Copyright © 2012, Elsevier Inc. All rights reserved. 6


Introduction
Processor/memory cost-performance gap

Copyright © 2012, Elsevier Inc. All rights reserved. 7


Introduction
Memory hierarchy design

 Memory hierarchy design becomes more crucial with recent


multi-core processors
 Aggregate peak bandwidth grows with # cores:
 Intel Core i7 can generate two references per core per clock
 Four cores and 3.2 GHz clock
 12.8 (4 cores x 3.2 GHz) billion 128-bit
instruction references +
 25.6 (2 x 4 cores x 3.2 GHz) billion 64-bit data references/second

 = 409.6 GB/s!

 DRAM bandwidth is only 6% of this (25 GB/s)


 Requires:
 Multi-port, pipelined caches
 Two levels of cache per core
 Shared third-level cache on chip

Copyright © 2012, Elsevier Inc. All rights reserved. 8


Introduction
Performance and power

 High-end microprocessors have >10 MB on-chip cache


 The cache consumes a large amount of area and power
budget

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Introduction
Memory hierarchy basics

 When a word is not found in the cache, a miss occurs.


 In case of a miss fetch word from lower level in hierarchy
 higher latency reference
 lower level may be:
 another cache
 the main memory
 fetch the entire block consisting of several words
 Takes advantage of spatial locality
 place block into cache in any location within its set,
determined by address
 block address MOD number of sets

Copyright © 2012, Elsevier Inc. All rights reserved. 10


Placement problem

Main
Memory Cache
Memory

Copyright © 2012, Elsevier Inc. All rights reserved. 11


Placement policies
 Main memory has a much larger capacity than cache.
 Mapping between main and cache memories.
 Where to put a block in cache

Copyright © 2012, Elsevier Inc. All rights reserved. 12


Fully associative cache
Memory
0
1
2
3
4
5
6
7
Cache
8 0
9 1
10 2
11 3
12 4
Block number

13 5
14 6
15 7
16
17
18
19
20
21 A block can be placed in any
22
23
24
location in cache.
25
26
27
28
29
30
31

13
Direct mapped cache
Memory
0
1
2
3
4
5
6
7
Cache
8
0
9
1
10
2
11 3
12
Block number

4
13
5
14
6
15 7
16
17
18
19 (Block address) MOD (Number of blocks in cache)
20
21 12 MOD 8 = 4
22
23
24
25
26
A block can be placed ONLY
27
28 in a single location in cache.
29
30
31

14
Set associative cache
Memory
0
1
2
3
4
5
6
7
8 0
Cache Set no.0
9 1

Block number
10 2
11 3 1
12 4
Block number

13 5 2
14 6
15 7
3
16
17
(Block address) MOD (Number of sets in cache)
18
19
20 12 MOD 4 = 0
21
22
23 A block can be placed in one
24
25 of n locations in n-way set
26
27 associative cache.
28
29
30
31

15
Introduction
Memory hierarchy basics

 n sets => n-way set associative


 Direct-mapped cache => one block per set
 Fully associative => one set

 Writing to cache: two strategies


 Write-through
 Immediately update lower levels of hierarchy
 Write-back
 Update lower levels of hierarchy only when an updated block
in cache is replaced
 Both strategies use write buffer to make writes
asynchronous

16
Dirty bit
 Two types of caches
 Instruction cache : I-cache
 Data cache: D-cache
 Dirt bit indicates if the cache block has been written to or
modified.
 No need for dirty bit for

 I-caches

 write through D-cache.

 Dirty bit needed for


 write back D-caches.

17
Write back

CPU

D Cache

Main memory

Copyright © 2012, Elsevier Inc. All rights reserved. 18


Write through cache

CPU

Cache

Main memory

Copyright © 2012, Elsevier Inc. All rights reserved. 19


Cache organization
 A cache row has:
 Tag  contains part of the address of data fetched from main memory
 Data bloc  contains data fetched from main memory
 Flags: valid, dirty

 An memory address is split (MSB to LSB) into:


 tag  contains the most significant bits of the address.
 index  gives the cache row the data has been put in.
 block offset  gives the desired data within the stored data block within the
cache row.

20
Cache organization

<21> <6> <5>


CPU
address
Tag Index blk
Data
Valid Tag Data
<1> <21> <256>

::

= MU
X

Copyright © 2012, Elsevier Inc. All rights reserved. 21


Introduction
Cache misses
 Miss rate  Fraction of cache access that result in a
miss

 Causes of misses
 Compulsory  first reference to a block

 Capacity  blocks discarded and later retrieved

 Conflict  the program makes repeated references to multiple

addresses from different blocks that map to the same location in the
cache

Copyright © 2012, Elsevier Inc. All rights reserved. 22


Introduction
Cache misses

 Speculative and multithreaded processors may execute other


instructions during a miss
 Reduced performance impact of misses

Copyright © 2012, Elsevier Inc. All rights reserved. 23


Introduction
Basic cache optimizations techniques
 Larger block size
 Reduces compulsory misses
 Increases capacity and conflict misses, increases miss penalty
 Larger total cache capacity to reduce miss rate
 Increases hit time, increases power consumption
 Higher associativity
 Reduces conflict misses
 Increases hit time, increases power consumption
 Higher number of cache levels
 Reduces overall memory access time
 Give priority to read misses over writes
 Reduces miss penalty
 Avoid address translation in cache indexing
 Reduces hit time

Copyright © 2012, Elsevier Inc. All rights reserved. 24


Advanced Optimizations
Advanced optimizations

 Metrics:
 Reducing the hit time
 Increase cache bandwidth
 Reducing miss penalty
 Reducing miss rate
 Reducing miss penalty or miss rate via parallelism

Copyright © 2012, Elsevier Inc. All rights reserved. 25


Advanced Optimizations
Ten advanced optimizations

 Small and simple first level caches


 Critical timing path:
 addressing tag memory, then
 comparing tags, then
 selecting correct set
 Direct-mapped caches can overlap tag compare and
transmission of data
 Lower associativity reduces power because fewer
cache lines are accessed

Copyright © 2012, Elsevier Inc. All rights reserved. 26


1) Fast hit times via small and simple L1 caches

 Critical timing path:


 addressing tag memory, then

 comparing tags, then

 selecting correct set

 Direct-mapped caches can overlap tag compare and


transmission of data
 Lower associativity reduces power because fewer cache
lines are accessed

Copyright © 2012, Elsevier Inc. All rights reserved. 27


Advanced Optimizations
L1 size and associativity

Access time vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 28


Advanced Optimizations
L1 size and associativity

Energy per read vs. size and associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 29


Advanced Optimizations
2) Fast hit times via way prediction
 How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
 Way prediction: keep extra bits in cache to predict the “way,” or block
within the set, of next cache access.
 Multiplexor is set early to select desired block, only 1 tag comparison
performed that clock cycle in parallel with reading the cache data
 Miss  1st check other blocks for matches in next clock cycle
 Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
 Prediction accuracy

 > 90% for two-way

 > 80% for four-way

 I-cache has better accuracy than D-cache

 First used on MIPS R10000 in mid-90s. Used on ARM Cortex-A8


 Extend to predict block as well.
 “Way selection”  increases mis-prediction penalty

Copyright © 2012, Elsevier Inc. All rights reserved. 30


Advanced Optimizations
3) Increase cache bandwidth by pipelining
 Pipelining improves bandwidth, but higher latency
 More clock cycles between the issue of the load and the
use of data
 Examples:

 Pentium: 1 cycle
 Pentium Pro – Pentium III: 2 cycles
 Pentium 4 – Core i7: 4 cycles
 Increases branch mis-prediction penalty
 Makes it easier to increase associativity

Copyright © 2012, Elsevier Inc. All rights reserved. 31


4. Increase cache bandwidth: non-blocking caches
 Pipelined processors allow out-of-order execution. The
processor should not stall during a data cache miss.
 Non-blocking cache or lockup-free cache allow
data cache to
continue to supply cache hits during a miss
 Requires additional bits on registers or out-of-order execution
 Requires multi-bank memories
 “hit under miss” reduces the effective miss penalty by
working
during miss vs. ignoring CPU requests
 “hit under multiple miss” or “miss under miss” may further
lower the effective miss penalty by overlapping multiple misses
 Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
 Requires multiple memory banks (otherwise cannot support)
 Pentium Pro allows 4 outstanding memory misses

32
33
Advanced Optimizations
Nonblocking caches
 Like pipelining the
memory system  allow
hits before previous
misses complete
 “Hit under miss”
 “Hit under multiple miss”
 Important for hiding
memory latency
 L2 must support this
 In general, processors can
hide L1 miss penalty but
not L2 miss penalty

Copyright © 2012, Elsevier Inc. All rights reserved. 34


35
http://csg.csail.mit.edu/6.S078
Advanced Optimizations
5) Independent banks; interleaving
 Organize cache as independent banks to support
simultaneous access
 ARM Cortex-A8 supports 1-4 banks for L2
 Intel i7 supports 4 banks for L1 and 8 banks for L2

 Interleave banks according to block address

Copyright © 2012, Elsevier Inc. All rights reserved. 36


Advanced Optimizations
6) Early restart and critical word first
 Reduce miss penalty.
 Don’t wait for full block before restarting CPU
 Early restart  As soon as the requested word of the
block arrives, send it to the CPU and let the CPU continue
execution
 Spatial locality  tend to want next sequential word, so not clear
size of benefit of just early restart
 Critical Word First Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block
 Long blocks more popular today  Critical Word 1st Widely used
block

Copyright © 2012, Elsevier Inc. All rights reserved. 37


7. Merging write buffer to reduce miss penalty
 Write buffer to allow processor to continue while waiting
to write to memory
 If buffer contains modified blocks, the addresses can be
checked to see if address of new data matches the address of
a valid write buffer entry
 If so, new data are combined with that entry
 Increases block size of write for write-through cache of
writes to sequential words, bytes since multiword writes
more efficient to memory
 The Sun T1 (Niagara) processor, among many others, uses
write merging

38
Advanced Optimizations
Merging write buffer
 When storing to a block that is already pending in the
write buffer, update write buffer
 Reduces stalls due to full write buffer
 Do not apply to I/O addresses

No write
buffering

Write buffering

Copyright © 2012, Elsevier Inc. All rights reserved. 39


8. Reduce misses by compiler optimizations
 McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
 Instructions
 Reorder procedures in memory so as to reduce conflict misses
 Profiling to look at conflicts(using tools they developed)
 Data
 Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
 Loop Interchange: change nesting of loops to access data in
order stored in memory
 Loop Fusion: Combine 2 independent loops that have same
looping and some variables overlap
 Blocking: Improve temporal locality by accessing “blocks” of
data repeatedly vs. going down whole columns or rows

40
Advanced Optimizations
Compiler optimizations

 Loop Interchange
 Swap nested loops to access memory in
sequential order

 Blocking
 Instead of accessing entire rows or columns,
subdivide matrices into blocks
 Requires more memory accesses but improves
locality of accesses

Copyright © 2012, Elsevier Inc. All rights reserved. 41


Merging arrays example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of structures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key improve


spatial locality

42
Loop interchange example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through


memory every 100 words; improved spatial
locality

43
Loop fusion example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] *
c[i][j];
d[i][j] = a[i][j] + c[i]
[j];}

2 misses per access to a & c vs. one miss per access;


improve spatial locality 44
Blocking example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k]
[j];};
x[i][j] = r;
};
 Two Inner Loops:
 Read all NxN elements of z[]

 Read N elements of 1 row of

y[] repeatedly
 Write N elements of 1 row

of x[]
 Capacity Misses a function of N
& Cache Size: 45
Blocking example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

 B called Blocking Factor


 Capacity Misses from 2N3 + N2 to 2N3/B +N2
 Conflict Misses Too?

46
Snapshot of arrays x,y,z when N=6 and i =1

 The age of access to the array elements is indicated by shade. White


 not yet touched
Light  older access
Dark  new access
In the “before” algorithm the elements of y and z are read repeatedly to
calculate x. Compare with the next slide which shows the “after” access
patterns. Indexes, I, j, and k are shown along the rows and columns.

47
48
Reducing conflict misses by blocking
0.1

Dire ct M appe d
0.05
Cache

Fully Associativ e
Cache
0
0 50 100
150
Blocking Factor
 Conflict misses in caches not FA vs. Blocking size
 Lam et al [1991] a blocking factor of 24 had a fifth the
misses vs. 48 despite both fit in cache

49
Summary of compiler optimizations to reduce
cache misses (by hand)
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice

chole sky
(nasa7)
compre ss
1 1.5 2 2.5 3
Performance Improvement

me rge d loop loop fusion blocking


arrays interchange

50
Advanced Optimizations
9) Hardware prefetching

 Fetch two blocks on miss (include next


sequential block)

Pentium 4 Pre-fetching

Copyright © 2012, Elsevier Inc. All rights reserved. 51


Advanced Optimizations
10) Compiler prefetching

 Insert prefetch instructions before data is


needed
 Non-faulting: prefetch doesn’t cause
exceptions

 Register prefetch
 Loads data into register
 Cache prefetch
 Loads data into cache

 Combine with loop unrolling and software


pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 52


Reducing misses by software prefetching
 Data Prefetch
 Load data into register (HP PA-RISC loads)
 Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
 Special prefetching instructions cannot cause
faults; a form of speculative execution

 Issuing Prefetch Instructions takes time


 Is cost of prefetch issues < savings in reduced
misses?
 Higher superscalar reduces difficulty of issue
bandwidth

53
Advanced Optimizations
Summary

Copyright © 2012, Elsevier Inc. All rights reserved. 54

You might also like