0% found this document useful (0 votes)
17 views78 pages

08 Caches

The document discusses memory caches and how they improve processor performance. Caches are faster memory placed between the processor and main memory. They work by exploiting the principle of locality, where programs tend to access data and instructions near addresses recently used. This reduces memory access time compared to accessing slower main memory directly.

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views78 pages

08 Caches

The document discusses memory caches and how they improve processor performance. Caches are faster memory placed between the processor and main memory. They work by exploiting the principle of locality, where programs tend to access data and instructions near addresses recently used. This reduces memory access time compared to accessing slower main memory directly.

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

University of Washington

Roadmap Memory & data


Integers & floats
Machine code & C
C: Java:
x86 assembly
car *c = malloc(sizeof(car)); Car c = new Car(); Procedures & stacks
c->miles = 100; c.setMiles(100);
Arrays & structs
c->gals = 17; c.setGals(17);
float mpg = get_mpg(c); float mpg =
Memory & caches
free(c); c.getMPG(); Processes
Virtual memory
Assembly get_mpg: Memory allocation
language: pushq %rbp Java vs. C
movq %rsp, %rbp
...
popq %rbp
ret
OS:
Machine 0111010000011000
100011010000010000000010
code: 1000100111000010
110000011111101000011111

Computer
system:

Autumn 2013 Memory and Caches 1


University of Washington

How does execution time grow with SIZE?


int array[SIZE];
int A = 0;

for (int i = 0 ; i < 200000 ; ++ i) {


for (int j = 0 ; j < SIZE ; ++ j) {
A += array[j];
}
} TIME

Plot
SIZE
Autumn 2013 Memory and Caches 2
University of Washington

Actual Data
Time

Autumn 2013
SIZE
Memory and Caches 3
University of Washington

Making memory accesses fast!


¢ Cache basics
¢ Principle of locality
¢ Memory hierarchies
¢ Cache organization
¢ Program optimizations that consider caches

Autumn 2013 Memory and Caches 4


University of Washington

Problem: Processor-Memory Bottleneck

Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Memory

Core 2 Duo: Core 2 Duo:


Can process at least Bandwidth
256 Bytes/cycle 2 Bytes/cycle
Latency
100 cycles

Problem: lots of waiting on memory


cycle = single fixed-time
Autumn 2013 machine step Memory and Caches 5
University of Washington

Problem: Processor-Memory Bottleneck

Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Cache
Memory

Core 2 Duo: Core 2 Duo:


Can process at least Bandwidth
256 Bytes/cycle 2 Bytes/cycle
Latency
100 cycles

Solution: caches
cycle = single fixed-time
Autumn 2013 machine step Memory and Caches 6
University of Washington

Cache
¢ English definition: a hidden storage space for provisions,
weapons, and/or treasures

¢ CSE definition: computer memory with short access time


used for the storage of frequently or recently used
instructions or data (i-cache and d-cache)

more generally,

used to optimize data transfers between system elements


with different characteristics (network interface cache, I/O
cache, etc.)

Autumn 2013 Memory and Caches 7


University of Washington

General Cache Mechanics

Smaller, faster, more expensive


Cache 8 9 14 3 memory caches a subset of
the blocks (a.k.a. lines)

Data is copied in block-sized


transfer units

Larger, slower, cheaper memory


Memory 0 1 2 3 viewed as partitioned into
4 5 6 7 “blocks” or “lines”

8 9 10 11
12 13 14 15

Autumn 2013 Memory and Caches 8


University of Washington

General Cache Concepts: Hit


Request: 14 Data in block b is needed

Block b is in cache:
Cache 8 9 14 3
Hit!

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Autumn 2013 Memory and Caches 9


University of Washington

General Cache Concepts: Miss


Request: 12 Data in block b is needed

Block b is not in cache:


Cache 8 9
12 14 3
Miss!

Block b is fetched from


12 Request: 12
memory

Block b is stored in cache


Memory 0 1 2 3 • Placement policy:
4 5 6 7 determines where b goes
• Replacement policy:
8 9 10 11
determines which block
12 13 14 15 gets evicted (victim)

Autumn 2013 Memory and Caches 10


University of Washington

Why Caches Work


¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

Autumn 2013 Memory and Caches 11


University of Washington

Why Caches Work


¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future

§ Why is this important?

Autumn 2013 Memory and Caches 12


University of Washington

Why Caches Work


¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future

¢ Spatial locality?

Autumn 2013 Memory and Caches 13


University of Washington

Why Caches Work


¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future
¢ Spatial locality:
§ Items with nearby addresses tend
to be referenced close together in time block

§ How do caches take advantage of this?

Autumn 2013 Memory and Caches 14


University of Washington

Example: Locality?
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;

¢ Data:
§ Temporal: sum referenced in each iteration
§ Spatial: array a[] accessed in stride-1 pattern
¢ Instructions:
§ Temporal: cycle through loop repeatedly
§ Spatial: reference instructions in sequence

¢ Being able to assess the locality of code is a crucial skill


for a programmer
Autumn 2013 Memory and Caches 15
University of Washington

Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3]
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}

Autumn 2013 Memory and Caches 16


University of Washington

Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3]
for (j = 0; j < N; j++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[0][1]
} 3: a[0][2]
4: a[0][3]
5: a[1][0]
6: a[1][1]
7: a[1][2]
8: a[1][3]
9: a[2][0]
10: a[2][1]
11: a[2][2]
12: a[2][3]

stride-1
Autumn 2013 Memory and Caches 17
University of Washington

Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3]
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}

Autumn 2013 Memory and Caches 18


University of Washington

Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3]
for (i = 0; i < M; i++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[1][0]
} 3: a[2][0]
4: a[0][1]
5: a[1][1]
6: a[2][1]
7: a[0][2]
8: a[1][2]
9: a[2][2]
10: a[0][3]
11: a[1][3]
12: a[2][3]

stride-N
Autumn 2013 Memory and Caches 19
University of Washington

Locality Example #3
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;

for (i = 0; i < N; i++)


for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}

¢ What is wrong with this code?


¢ How can it be fixed?

Autumn 2013 Memory and Caches 20


University of Washington

Cost of Cache Misses


¢ Huge difference between a hit and a miss
§ Could be 100x, if just L1 and main memory

¢ Would you believe 99% hits is twice as good as 97%?


§ Consider:
cycle = single fixed-time
Cache hit time of 1 cycle
machine step
Miss penalty of 100 cycles

Autumn 2013 Memory and Caches 21


University of Washington

Cost of Cache Misses


¢ Huge difference between a hit and a miss
§ Could be 100x, if just L1 and main memory

¢ Would you believe 99% hits is twice as good as 97%?


§ Consider:
cycle = single fixed-time
Cache hit time of 1 cycle
machine step
Miss penalty of 100 cycles

check the cache every time


§ Average access time:
§ 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
§ 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

¢ This is why “miss rate” is used instead of “hit rate”


Autumn 2013 Memory and Caches 22
University of Washington

Cache Performance Metrics


¢ Miss Rate
§ Fraction of memory references not found in cache (misses / accesses)
= 1 - hit rate
§ Typical numbers (in percentages):
§ 3% - 10% for L1
§ Can be quite small (e.g., < 1%) for L2, depending on size, etc.

¢ Hit Time
§ Time to deliver a line in the cache to the processor
§Includes time to determine whether the line is in the cache
§ Typical hit times: 1 - 2 clock cycles for L1; 5 - 20 clock cycles for L2
¢ Miss Penalty
§ Additional time required because of a miss
§ Typically 50 - 200 cycles for L2 (trend: increasing!)

Autumn 2013 Memory and Caches 23


University of Washington

Can we have more than one cache?


¢ Why would we want to do that?

Autumn 2013 Memory and Caches 24


University of Washington

Memory Hierarchies
¢ Some fundamental and enduring properties of hardware and
software systems:
§ Faster storage technologies almost always cost more per byte and have
lower capacity
§ The gaps between memory technology speeds are widening
§ True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
§ Well-written programs tend to exhibit good locality

¢ These properties complement each other beautifully

¢ They suggest an approach for organizing memory and storage


systems known as a memory hierarchy

Autumn 2013 Memory and Caches 25


University of Washington

An Example Memory Hierarchy

registers CPU registers hold words retrieved from L1 cache

on-chip L1
Smaller, cache (SRAM) L1 cache holds cache lines retrieved from L2 cache
faster,
costlier
off-chip L2
per byte cache (SRAM) L2 cache holds cache lines retrieved
from main memory

Larger, main memory


slower, (DRAM) Main memory holds disk blocks
retrieved from local disks
cheaper
per byte local secondary storage
Local disks hold files
(local disks) retrieved from disks on
remote network servers

remote secondary storage


(distributed file systems, web servers)

Autumn 2013 Memory and Caches 26


University of Washington

An Example Memory Hierarchy


explicitly program-controlled
registers

on-chip L1
Smaller, cache (SRAM)
program sees “memory”;
faster,
costlier hardware manages caching
off-chip L2 transparently
per byte cache (SRAM)

Larger, main memory


slower, (DRAM)
cheaper
per byte local secondary storage
(local disks)

remote secondary storage


(distributed file systems, web servers)

Autumn 2013 Memory and Caches 27


University of Washington

Memory Hierarchies
¢ Fundamental idea of a memory hierarchy:
§ For each k, the faster, smaller device at level k serves as a cache for the
larger, slower device at level k+1.
¢ Why do memory hierarchies work?
§ Because of locality, programs tend to access the data at level k more
often than they access the data at level k+1.
§ Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit.
¢ Big Idea: The memory hierarchy creates a large pool of
storage that costs as much as the cheap storage near the
bottom, but that serves data to programs at the rate of the
fast storage near the top.

Autumn 2013 Memory and Caches 28


University of Washington

Intel Core i7 Cache Hierarchy


Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:


8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

Main memory
Autumn 2013 Memory and Caches 29
University of Washington

Where should we put data in the cache?


Memory Cache
0000
0001
0010
0011
0100
0101 Index Data
0110 00
0111 01
1000 10
1001 11
1010
1011
1100
1101 address mod cache size
1110
1111 same as
¢ How can we compute this mapping? low-order log2(cache size) bits
Autumn 2013 Memory and Caches 30
University of Washington

Where should we put data in the cache?


Memory Cache
0000
0001
0010
0011
0100
0101 Index Data
0110 00
0111 01
1000 10
1001 11
1010
1011
1100
1101 Collision.
1110
1111 Hmm.. The cache might get confused later!
Why? And how do we solve that?
Autumn 2013 Memory and Caches 31
University of Washington

Use tags to record which location is cached


Memory Cache
0000
0001
0010
0011
0100
0101 Index Tag Data
0110 00 00
0111 01 ??
1000 10 01
1001 11 01
1010
1011
1100
1101
1110
1111
tag = rest of address bits
Autumn 2013 Memory and Caches 32
University of Washington

What’s a cache block? (or cache line)


Byte Block (line)
Address number
0 0 block/line size = ?
1
2 1
3
4 2 Index
5
6 0
3
7 1
8 2
4
9 3
10 5
11
12 6
13
14
15
7 typical block/line sizes:
32 bytes, 64 bytes
Autumn 2013 Memory and Caches 33
University of Washington

A puzzle.
¢ What can you infer from this:

¢ Cache starts empty


¢ Access (addr, hit/miss) stream:

¢ (10, miss), (11, hit), (12, miss)

block size >= 2 bytes block size < 8 bytes

Autumn 2013 Memory and Caches 34


University of Washington

Problems with direct mapped caches?

¢ direct mapped: Memory


§ Each memory address can Address
be mapped to exactly one 0000
index in the cache. 0001
¢ What happens if a 0010
0011
program uses addresses 0100 Index
0101
2, 6, 2, 6, 2, …? 0110 00
0111 01
1000 10
¢ conflict 1001 11
1010
1011
1100
1101
1110
1111

35
Autumn 2013 Memory and Caches
University of Washington

Associativity
¢ What if we could store data in any place in the cache?

Autumn 2013 Memory and Caches 36


University of Washington

Associativity
¢ What if we could store data in any place in the cache?
¢ That might slow down caches (more complicated hardware), so
we do something in between.
¢ Each address maps to exactly one set.
1-way 2-way 4-way 8-way
8 sets, 4 sets, 2 sets, 1 set,
1 block each 2 blocks each 4 blocks each 8 blocks

Set Set Set Set


0
1 0
2 0
1
3
0
4
2
5 1
6 3
7

direct mapped fully associative


Autumn 2013 Memory and Caches 37
University of Washington

Now how do I know where data goes?

(m-k-n) bits k bits


n-bit Block
m-bit Address Tag Index Offset

Autumn 2013 Memory and Caches 38


University of Washington

What’s a cache block? (or cache line)


Byte Block (line)
Address number
0 0 block/line size = ?
1
2 1
3
4 2 Index
5
6 0
3
7 1
8 2
4
9 3
10 5
11
12 6
13
14
15
7 typical block/line sizes:
32 bytes, 64 bytes
Autumn 2013 Memory and Caches 39
University of Washington

Now how do I know where data goes?

(m-k-n) bits k bits


n-bit Block
m-bit Address Tag Index Offset

Our example used a 22-block cache with 21 bytes per


block. Where would 13 (1101) be stored?

? bits ? bits
?-bits Block
4-bit Address Offset

Autumn 2013 Memory and Caches 40


University of Washington

Example placement in set-associative caches


¢ Where would data from address 0x1833 be placed?
§ Block size is 16 bytes.
¢ 0x1833 in binary is 00...0110000 011 0011.
(m-k-n) bits k bits
n-bit Block
m-bit Address Tag Index Offset

k=? k=? k=?

1-way associativity 2-way associativity 4-way associativity


8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7
41
Autumn 2013 Memory and Caches
University of Washington

Example placement in set-associative caches

¢ Where would data from address 0x1833 be placed?


§ Block size is 16 bytes.
¢ 0x1833 in binary is 00...0110000 011 0011.
(m-k-4) bits k bits
4-bit Block
m-bit Address Tag Index Offset

k=3 k=2 k=1


1-way associativity 2-way associativity 4-way associativity
8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7
42
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Any empty block in the correct set may be used for storing data.
¢ If there are no empty blocks, which one should we replace?

1-way associativity 2-way associativity 4-way associativity


8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7

43
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?

1-way associativity 2-way associativity 4-way associativity


8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7

44
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?
§ Obvious for direct-mapped caches, what about set-associative?

1-way associativity 2-way associativity 4-way associativity


8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7

45
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?
§ Caches typically use something close to least recently used (LRU)
§ (hardware usually implements “not most recently used”)

1-way associativity 2-way associativity 4-way associativity


8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set


0
1 0
2 0
1
3
4
2
5 1
6 3
7

46
Autumn 2013 Memory and Caches
University of Washington

Another puzzle.
¢ What can you infer from this:

¢ Cache starts empty


¢ Access (addr, hit/miss) stream

¢ (10, miss); (12, miss); (10, miss)

12 is not in the same 12’s block replaced 10’s block


block as 10

direct-mapped cache
Autumn 2013 Memory and Caches 47
University of Washington

General Cache Organization (S, E, B)


E = 2e lines per set (we say “E-way”)
set

line

S = 2s sets

cache size:
v tag 0 1 2 B-1
S x E x B data bytes
valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 48
University of Washington

Cache Read • Locate set


• Check if any line in set
has matching tag
E = 2e lines per set • Yes + line valid: hit
• Locate data starting
at offset

Address of byte in memory:


t bits s bits b bits

S = 2s sets
tag set block
index offset

data begins at this offset

v tag 0 1 2 B-1

valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 49
University of Washington

Example: Direct-Mapped Cache (E = 1)


Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7

Autumn 2013 Memory and Caches 50


University of Washington

Example: Direct-Mapped Cache (E = 1)


Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
valid? + match?: yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

Autumn 2013 Memory and Caches 51


University of Washington

Example: Direct-Mapped Cache (E = 1)


Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
valid? + match?: yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

int (4 Bytes) is here

No match: old line is evicted and replaced

Autumn 2013 Memory and Caches 52


University of Washington

Assume sum, i, j in registers


Address of an aligned element
Example (for E = 1) of a: aa...ayyyyxxxx000
Assume: cold (empty) cache
int sum_array_rows(double a[16][16]) 3 bits for set, 5 bits for offset
{
int i, j;
aa...ayyy yxx xx000
double sum = 0; 2,0:
1,0:
0,4:
0,0: aa...a000
aa...a001 001 000
100 00000

for (i = 0; i < 16; i++) 0,0 0,1 0,2 0,3 4,0


0,0 4,1
2,1 0,2
2,0 0,1 4,2 4,3
2,2 0,3
2,3
for (j = 0; j < 16; j++)
sum += a[i][j]; 0,4 0,5 0,6 0,7
return sum; 0,8 0,9 0,a 0,b
} 0,c 0,d 0,e 0,f

int sum_array_cols(double a[16][16]) 1,0 1,1 1,2 1,3 3,0


1,0 3,1
1,1 3,2
1,2 3,3
1,3
{ 1,4 1,5 1,6 1,7
int i, j; 1,8 1,9 1,a 1,b
double sum = 0;
1,c 1,d 1,e 1,f
for (j = 0; j < 16; j++)
for (i = 0; i < 16; i++)
sum += a[i][j]; 32 B = 4 doubles 32 B = 4 doubles
return sum; 4 misses per row of array every access a miss
} 4*16 = 64 misses 16*16 = 256 misses

Autumn 2013 Memory and Caches 53


University of Washington

Example (for E = 1) In this example, cache blocks are


16 bytes; 8 sets in cache
How many block offset bits?
float dotprod(float x[8], float y[8]) How many set index bits?
{
float sum = 0; Address bits: ttt....t sss bbbb
int i; B = 16 = 2b: b=4 offset bits
for (i = 0; i < 8; i++) S = 8 = 2s: s=3 index bits
sum += x[i]*y[i];
return sum; 0: 000....0 000 0000
} 128: 000....1 000 0000
160: 000....1 010 0000

y[0] x[1]
x[0] y[1] x[2] y[3]
y[2] x[3] x[0] x[1] x[2] x[3]
x[4] x[5] x[6] x[7]
if x and y have aligned if x and y have unaligned y[0] y[1] y[2] y[3]
starting addresses, starting addresses,
y[4] y[5] y[6] y[7]
e.g., &x[0] = 0, &y[0] = 128 e.g., &x[0] = 0, &y[0] = 160

Autumn 2013 Memory and Caches 54


University of Washington

E-way Set-Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

Autumn 2013 Memory and Caches 55


University of Washington

E-way Set-Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag
tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

Autumn 2013 Memory and Caches 56


University of Washington

E-way Set-Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

No match:
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Autumn 2013 Memory and Caches 57
University of Washington

Example (for E = 2)
float dotprod(float x[8], float y[8])
{
float sum = 0;
int i;

for (i = 0; i < 8; i++)


sum += x[i]*y[i];
return sum;
}

If x and y have aligned starting x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]
addresses, e.g. &x[0] = 0, &y[0] = 128, x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]
can still fit both because two lines in
each set

Autumn 2013 Memory and Caches 58


University of Washington

Types of Cache Misses


¢ Cold (compulsory) miss
§ Occurs on first access to a block
¢ Conflict miss
§ Conflict misses occur when the cache is large enough, but multiple data
objects all map to the same slot
§ e.g., referencing blocks 0, 8, 0, 8, ... would miss every time
§ direct-mapped caches have more conflict misses than
n-way set-associative (where n is a power of 2 and n > 1)
¢ Capacity miss
§ Occurs when the set of active cache blocks (the working set)
is larger than the cache (just won’t fit)

Autumn 2013 Memory and Caches 59


University of Washington

What about writes?


¢ Multiple copies of data exist:
§ L1, L2, possibly L3, main memory
¢ What is the main problem with that?

Autumn 2013 Memory and Caches 60


University of Washington

What about writes?


¢ Multiple copies of data exist:
§ L1, L2, possibly L3, main memory
¢ What to do on a write-hit?
§ Write-through: write immediately to memory, all caches in between.
§ Write-back: defer write to memory until line is evicted (replaced)
§ Need a dirty bit to indicate if line is different from memory or not
¢ What to do on a write-miss?
§ Write-allocate: load into cache, update line in cache.
§Good if more writes or reads to the location follow
§ No-write-allocate: just write immediately to memory.
¢ Typical caches:
§ Write-back + Write-allocate, usually why?
§ Write-through + No-write-allocate, occasionally
Autumn 2013 Memory and Caches 61
University of Washington

Write-back, write-allocate example


mov 0xFACE, T

Cache U 0xBEEF 0 dirty bit

Memory T 0xCAFE
U 0xBEEF

Autumn 2013 Memory and Caches 62


University of Washington

Write-back, write-allocate example


mov 0xFACE, T mov 0xFEED, T mov U, %rax

Cache T
U 0xFEED
0xCAFE
0xFACE
0xBEEF 0
1 dirty bit

Memory T 0xCAFE
U 0xBEEF

Autumn 2013 Memory and Caches 63


University of Washington

Write-back, write-allocate example


mov 0xFACE, T mov 0xFEED, T mov U, %rax

Cache U 0xBEEF 0 dirty bit

Memory T 0xFEED
U 0xBEEF

Autumn 2013 Memory and Caches 64


University of Washington

Back to the Core i7 to look at ways


Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:


8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

slower, but
Main memory more likely
to hit
Autumn 2013 Memory and Caches 65
University of Washington

Where else is caching used?

Autumn 2013 Memory and Caches 66


University of Washington

Software Caches are More Flexible


¢ Examples
§ File system buffer caches, web browser caches, etc.

¢ Some design differences


§ Almost always fully-associative
§ so, no placement restrictions
§ index structures like hash tables are common (for placement)
§ Often use complex replacement policies
§ misses are very expensive when disk or network involved
§ worth thousands of cycles to avoid them
§ Not necessarily constrained to single “block” transfers
§ may fetch or write-back in larger units, opportunistically

Autumn 2013 Memory and Caches 67


University of Washington

Optimizations for the Memory Hierarchy


¢ Write code that has locality!
§ Spatial: access data contiguously
§ Temporal: make sure access to the same data is not too far apart in time
¢ How can you achieve locality?
§ Proper choice of algorithm
§ Loop transformations

Autumn 2013 Memory and Caches 68


University of Washington

Example: Matrix Multiplication


c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */


void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c[i*n + j] += a[i*n + k]*b[k*n + j];
}

j j
c a b
=i *
i

memory access pattern?


Autumn 2013 Memory and Caches 69
University of Washington

spatial locality:
Cache Miss Analysis chunks of 8 items in a row
in same cache line
¢ Assume: each item in column in
§ Matrix elements are doubles different cache line
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n, not left-shifted by n)
n/8 misses n
¢ First iteration: …
§ n/8 + n = 9n/8 misses

n misses
(omitting matrix c)
= *
§ Afterwards in cache:
(schematic)
= *
8 wide
Autumn 2013 Memory and Caches 70
University of Washington

Cache Miss Analysis


¢ Assume:
§ Matrix elements are doubles
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
n
¢ Other iterations:
§ Again:
n/8 + n = 9n/8 misses
(omitting matrix c) = *
8 wide

¢ Total misses:
§ 9n/8 * n2 = (9/8) * n3

Autumn 2013 once per element


Memory and Caches 71
University of Washington

Blocked Matrix Multiplication


c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */


void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i+=B)
for (j = 0; j < n; j+=B)
for (k = 0; k < n; k+=B)
/* B x B mini matrix multiplications */
for (i1 = i; i1 < i+B; i1++)
for (j1 = j; j1 < j+B; j1++)
for (k1 = k; k1 < k+B; k1++)
c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];
}

j1
c a b
= i1 *

Block size B x B
Autumn 2013 Memory and Caches 72
University of Washington

Cache Miss Analysis


¢ Assume:
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
§ Three blocks fit into cache: 3B2 < C
B2 elements per block, 8 per cache line
n/B blocks
¢ First (block) iteration:
§ B2/8 misses for each block
§ 2n/B * B2/8 = nB/4
(omitting matrix c) = *
n/B blocks per row,
n/B blocks per column
Block size B x B
§ Afterwards in cache
(schematic)
= *
Autumn 2013 Memory and Caches 73
University of Washington

Cache Miss Analysis


¢ Assume:
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
§ Three blocks fit into cache: 3B2 < C

n/B blocks
¢ Other (block) iterations:
§ Same as first iteration
§ 2n/B * B2/8 = nB/4
= *

¢ Total misses: Block size B x B


§ nB/4 * (n/B)2 = n3/(4B)

Autumn 2013 Memory and Caches 74


University of Washington

Summary
¢ No blocking: (9/8) * n3
¢ Blocking: 1/(4B) * n3
¢ If B = 8 difference is 4 * 8 * 9 / 8 = 36x
¢ If B = 16 difference is 4 * 16 * 9 / 8 = 72x

¢ Suggests largest possible block size B, but limit 3B2 < C!

¢ Reason for dramatic difference:


§ Matrix multiplication has inherent temporal locality:
§ Input data: 3n2, computation 2n3
§ Every array element used O(n) times!
§ But program has to be written properly

Autumn 2013 Memory and Caches 75


University of Washington

Cache-Friendly Code
¢ Programmer can optimize for cache performance
§ How data structures are organized
§ How data are accessed
§ Nested loop structure
§ Blocking is a general technique
¢ All systems favor “cache-friendly code”
§ Getting absolute optimum performance is very platform specific
§Cache sizes, line sizes, associativities, etc.
§ Can get most of the advantage with generic code
§ Keep working set reasonably small (temporal locality)
§ Use small strides (spatial locality)
§ Focus on inner loop code

Autumn 2013 Memory and Caches 76


University of Washington

Intel Core i7 Cache Hierarchy


Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:


8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

Main memory
Autumn 2013 Memory and Caches 77
University of Washington

Intel Core i7
The Memory Mountain 32 KB L1 i-cache
32 KB L1 d-cache
256 KB unified L2 cache
8M unified L3 cache
Read throughput (MB/s)

7000
L1 All caches on-chip
6000

5000

4000
L2
3000

2000
L3
1000

2K
Mem
s1

16K
s3
s5

128K
s7
s9

1M
s11
s13

8M
s15

Stride (x8 bytes) Working set size (bytes)


s32
64M

Autumn 2013 Memory and Caches 78

You might also like