0% found this document useful (0 votes)

17 views78 pages

08 Caches

The document discusses memory caches and how they improve processor performance. Caches are faster memory placed between the processor and main memory. They work by exploiting the principle of locality, where programs tend to access data and instructions near addresses recently used. This reduces memory access time compared to accessing slower main memory directly.

Uploaded by

oreh2345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views78 pages

08 Caches

Uploaded by

oreh2345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

University of Washington

Roadmap Memory & data

Integers & floats
Machine code & C
C: Java:
x86 assembly
car *c = malloc(sizeof(car)); Car c = new Car(); Procedures & stacks
c->miles = 100; c.setMiles(100);
Arrays & structs
c->gals = 17; c.setGals(17);
float mpg = get_mpg(c); float mpg =
Memory & caches
free(c); c.getMPG(); Processes
Virtual memory
Assembly get_mpg: Memory allocation
language: pushq %rbp Java vs. C
movq %rsp, %rbp
...
popq %rbp
ret
OS:
Machine 0111010000011000
100011010000010000000010
code: 1000100111000010
110000011111101000011111

Computer
system:

Autumn 2013 Memory and Caches 1

University of Washington

How does execution time grow with SIZE?

int array[SIZE];
int A = 0;

for (int i = 0 ; i < 200000 ; ++ i) {

for (int j = 0 ; j < SIZE ; ++ j) {
A += array[j];
}
} TIME

Plot
SIZE
Autumn 2013 Memory and Caches 2
University of Washington

Actual Data
Time

Autumn 2013
SIZE
Memory and Caches 3
University of Washington

Making memory accesses fast!

¢ Cache basics
¢ Principle of locality
¢ Memory hierarchies
¢ Cache organization
¢ Program optimizations that consider caches

Autumn 2013 Memory and Caches 4

University of Washington

Problem: Processor-Memory Bottleneck

Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Memory

Core 2 Duo: Core 2 Duo:

Can process at least Bandwidth
256 Bytes/cycle 2 Bytes/cycle
Latency
100 cycles

Problem: lots of waiting on memory

cycle = single fixed-time
Autumn 2013 machine step Memory and Caches 5
University of Washington

Problem: Processor-Memory Bottleneck

Processor performance
doubled about
every 18 months Bus bandwidth
evolved much slower
CPU Reg
Main
Cache
Memory

Core 2 Duo: Core 2 Duo:

Can process at least Bandwidth
256 Bytes/cycle 2 Bytes/cycle
Latency
100 cycles

Solution: caches
cycle = single fixed-time
Autumn 2013 machine step Memory and Caches 6
University of Washington

Cache
¢ English definition: a hidden storage space for provisions,
weapons, and/or treasures

¢ CSE definition: computer memory with short access time

used for the storage of frequently or recently used
instructions or data (i-cache and d-cache)

more generally,

used to optimize data transfers between system elements

with different characteristics (network interface cache, I/O
cache, etc.)

Autumn 2013 Memory and Caches 7

University of Washington

General Cache Mechanics

Smaller, faster, more expensive

Cache 8 9 14 3 memory caches a subset of
the blocks (a.k.a. lines)

Data is copied in block-sized

transfer units

Larger, slower, cheaper memory

Memory 0 1 2 3 viewed as partitioned into
4 5 6 7 “blocks” or “lines”

8 9 10 11
12 13 14 15

Autumn 2013 Memory and Caches 8

University of Washington

General Cache Concepts: Hit

Request: 14 Data in block b is needed

Block b is in cache:
Cache 8 9 14 3
Hit!

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Autumn 2013 Memory and Caches 9

University of Washington

General Cache Concepts: Miss

Request: 12 Data in block b is needed

Block b is not in cache:

Cache 8 9
12 14 3
Miss!

Block b is fetched from

12 Request: 12
memory

Block b is stored in cache

Memory 0 1 2 3 • Placement policy:
4 5 6 7 determines where b goes
• Replacement policy:
8 9 10 11
determines which block
12 13 14 15 gets evicted (victim)

Autumn 2013 Memory and Caches 10

University of Washington

Why Caches Work

¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

Autumn 2013 Memory and Caches 11

University of Washington

Why Caches Work

¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future

§ Why is this important?

Autumn 2013 Memory and Caches 12

University of Washington

Why Caches Work

¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future

¢ Spatial locality?

Autumn 2013 Memory and Caches 13

University of Washington

Why Caches Work

¢ Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

¢ Temporal locality:
§ Recently referenced items are likely block
to be referenced again in the near future
¢ Spatial locality:
§ Items with nearby addresses tend
to be referenced close together in time block

§ How do caches take advantage of this?

Autumn 2013 Memory and Caches 14

University of Washington

Example: Locality?
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;

¢ Data:
§ Temporal: sum referenced in each iteration
§ Spatial: array a[] accessed in stride-1 pattern
¢ Instructions:
§ Temporal: cycle through loop repeatedly
§ Spatial: reference instructions in sequence

¢ Being able to assess the locality of code is a crucial skill

for a programmer
Autumn 2013 Memory and Caches 15
University of Washington

Autumn 2013 Memory and Caches 16

University of Washington

Locality Example #1
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (i = 0; i < M; i++) a[2][0] a[2][1] a[2][2] a[2][3]
for (j = 0; j < N; j++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[0][1]
} 3: a[0][2]
4: a[0][3]
5: a[1][0]
6: a[1][1]
7: a[1][2]
8: a[1][3]
9: a[2][0]
10: a[2][1]
11: a[2][2]
12: a[2][3]

stride-1
Autumn 2013 Memory and Caches 17
University of Washington

Autumn 2013 Memory and Caches 18

University of Washington

Locality Example #2
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0; a[0][0] a[0][1] a[0][2] a[0][3]
a[1][0] a[1][1] a[1][2] a[1][3]
for (j = 0; j < N; j++) a[2][0] a[2][1] a[2][2] a[2][3]
for (i = 0; i < M; i++)
sum += a[i][j]; 1: a[0][0]
return sum; 2: a[1][0]
} 3: a[2][0]
4: a[0][1]
5: a[1][1]
6: a[2][1]
7: a[0][2]
8: a[1][2]
9: a[2][2]
10: a[0][3]
11: a[1][3]
12: a[2][3]

stride-N
Autumn 2013 Memory and Caches 19
University of Washington

Locality Example #3
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}

¢ What is wrong with this code?

¢ How can it be fixed?

Autumn 2013 Memory and Caches 20

University of Washington

Cost of Cache Misses

¢ Huge difference between a hit and a miss
§ Could be 100x, if just L1 and main memory

¢ Would you believe 99% hits is twice as good as 97%?

§ Consider:
cycle = single fixed-time
Cache hit time of 1 cycle
machine step
Miss penalty of 100 cycles

Autumn 2013 Memory and Caches 21

University of Washington

Cost of Cache Misses

¢ Huge difference between a hit and a miss
§ Could be 100x, if just L1 and main memory

¢ Would you believe 99% hits is twice as good as 97%?

§ Consider:
cycle = single fixed-time
Cache hit time of 1 cycle
machine step
Miss penalty of 100 cycles

check the cache every time

§ Average access time:
§ 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
§ 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

¢ This is why “miss rate” is used instead of “hit rate”

Autumn 2013 Memory and Caches 22
University of Washington

Cache Performance Metrics

¢ Miss Rate
§ Fraction of memory references not found in cache (misses / accesses)
= 1 - hit rate
§ Typical numbers (in percentages):
§ 3% - 10% for L1
§ Can be quite small (e.g., < 1%) for L2, depending on size, etc.

¢ Hit Time
§ Time to deliver a line in the cache to the processor
§Includes time to determine whether the line is in the cache
§ Typical hit times: 1 - 2 clock cycles for L1; 5 - 20 clock cycles for L2
¢ Miss Penalty
§ Additional time required because of a miss
§ Typically 50 - 200 cycles for L2 (trend: increasing!)

Autumn 2013 Memory and Caches 23

University of Washington

Can we have more than one cache?

¢ Why would we want to do that?

Autumn 2013 Memory and Caches 24

University of Washington

Memory Hierarchies
¢ Some fundamental and enduring properties of hardware and
software systems:
§ Faster storage technologies almost always cost more per byte and have
lower capacity
§ The gaps between memory technology speeds are widening
§ True for: registers ↔ cache, cache ↔ DRAM, DRAM ↔ disk, etc.
§ Well-written programs tend to exhibit good locality

¢ These properties complement each other beautifully

¢ They suggest an approach for organizing memory and storage

systems known as a memory hierarchy

Autumn 2013 Memory and Caches 25

University of Washington

An Example Memory Hierarchy

registers CPU registers hold words retrieved from L1 cache

on-chip L1
Smaller, cache (SRAM) L1 cache holds cache lines retrieved from L2 cache
faster,
costlier
off-chip L2
per byte cache (SRAM) L2 cache holds cache lines retrieved
from main memory

Larger, main memory

slower, (DRAM) Main memory holds disk blocks
retrieved from local disks
cheaper
per byte local secondary storage
Local disks hold files
(local disks) retrieved from disks on
remote network servers

remote secondary storage

(distributed file systems, web servers)

Autumn 2013 Memory and Caches 26

University of Washington

An Example Memory Hierarchy

explicitly program-controlled
registers

on-chip L1
Smaller, cache (SRAM)
program sees “memory”;
faster,
costlier hardware manages caching
off-chip L2 transparently
per byte cache (SRAM)

Larger, main memory

slower, (DRAM)
cheaper
per byte local secondary storage
(local disks)

remote secondary storage

(distributed file systems, web servers)

Autumn 2013 Memory and Caches 27

University of Washington

Memory Hierarchies
¢ Fundamental idea of a memory hierarchy:
§ For each k, the faster, smaller device at level k serves as a cache for the
larger, slower device at level k+1.
¢ Why do memory hierarchies work?
§ Because of locality, programs tend to access the data at level k more
often than they access the data at level k+1.
§ Thus, the storage at level k+1 can be slower, and thus larger and
cheaper per bit.
¢ Big Idea: The memory hierarchy creates a large pool of
storage that costs as much as the cheap storage near the
bottom, but that serves data to programs at the rate of the
fast storage near the top.

Autumn 2013 Memory and Caches 28

University of Washington

Intel Core i7 Cache Hierarchy

Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:

8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

Main memory
Autumn 2013 Memory and Caches 29
University of Washington

Where should we put data in the cache?

Memory Cache
0000
0001
0010
0011
0100
0101 Index Data
0110 00
0111 01
1000 10
1001 11
1010
1011
1100
1101 address mod cache size
1110
1111 same as
¢ How can we compute this mapping? low-order log2(cache size) bits
Autumn 2013 Memory and Caches 30
University of Washington

Where should we put data in the cache?

Memory Cache
0000
0001
0010
0011
0100
0101 Index Data
0110 00
0111 01
1000 10
1001 11
1010
1011
1100
1101 Collision.
1110
1111 Hmm.. The cache might get confused later!
Why? And how do we solve that?
Autumn 2013 Memory and Caches 31
University of Washington

What’s a cache block? (or cache line)

Byte Block (line)
Address number
0 0 block/line size = ?
1
2 1
3
4 2 Index
5
6 0
3
7 1
8 2
4
9 3
10 5
11
12 6
13
14
15
7 typical block/line sizes:
32 bytes, 64 bytes
Autumn 2013 Memory and Caches 33
University of Washington

A puzzle.
¢ What can you infer from this:

¢ Cache starts empty

¢ Access (addr, hit/miss) stream:

¢ (10, miss), (11, hit), (12, miss)

block size >= 2 bytes block size < 8 bytes

Autumn 2013 Memory and Caches 34

University of Washington

Problems with direct mapped caches?

¢ direct mapped: Memory

§ Each memory address can Address
be mapped to exactly one 0000
index in the cache. 0001
¢ What happens if a 0010
0011
program uses addresses 0100 Index
0101
2, 6, 2, 6, 2, …? 0110 00
0111 01
1000 10
¢ conflict 1001 11
1010
1011
1100
1101
1110
1111

35
Autumn 2013 Memory and Caches
University of Washington

Associativity
¢ What if we could store data in any place in the cache?

Autumn 2013 Memory and Caches 36

University of Washington

Associativity
¢ What if we could store data in any place in the cache?
¢ That might slow down caches (more complicated hardware), so
we do something in between.
¢ Each address maps to exactly one set.
1-way 2-way 4-way 8-way
8 sets, 4 sets, 2 sets, 1 set,
1 block each 2 blocks each 4 blocks each 8 blocks

Set Set Set Set

0
1 0
2 0
1
3
0
4
2
5 1
6 3
7

direct mapped fully associative

Autumn 2013 Memory and Caches 37
University of Washington

Now how do I know where data goes?

(m-k-n) bits k bits

n-bit Block
m-bit Address Tag Index Offset

Autumn 2013 Memory and Caches 38

University of Washington

What’s a cache block? (or cache line)

Byte Block (line)
Address number
0 0 block/line size = ?
1
2 1
3
4 2 Index
5
6 0
3
7 1
8 2
4
9 3
10 5
11
12 6
13
14
15
7 typical block/line sizes:
32 bytes, 64 bytes
Autumn 2013 Memory and Caches 39
University of Washington

Now how do I know where data goes?

(m-k-n) bits k bits

n-bit Block
m-bit Address Tag Index Offset

Our example used a 22-block cache with 21 bytes per

block. Where would 13 (1101) be stored?

? bits ? bits
?-bits Block
4-bit Address Offset

Autumn 2013 Memory and Caches 40

University of Washington

Example placement in set-associative caches

¢ Where would data from address 0x1833 be placed?
§ Block size is 16 bytes.
¢ 0x1833 in binary is 00...0110000 011 0011.
(m-k-n) bits k bits
n-bit Block
m-bit Address Tag Index Offset

k=? k=? k=?

1-way associativity 2-way associativity 4-way associativity

8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7
41
Autumn 2013 Memory and Caches
University of Washington

Example placement in set-associative caches

¢ Where would data from address 0x1833 be placed?

§ Block size is 16 bytes.
¢ 0x1833 in binary is 00...0110000 011 0011.
(m-k-4) bits k bits
4-bit Block
m-bit Address Tag Index Offset

k=3 k=2 k=1

1-way associativity 2-way associativity 4-way associativity
8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7
42
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Any empty block in the correct set may be used for storing data.
¢ If there are no empty blocks, which one should we replace?

1-way associativity 2-way associativity 4-way associativity

8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7

43
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?

1-way associativity 2-way associativity 4-way associativity

8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7

44
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?
§ Obvious for direct-mapped caches, what about set-associative?

1-way associativity 2-way associativity 4-way associativity

8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7

45
Autumn 2013 Memory and Caches
University of Washington

Block replacement
¢ Replace something, of course, but what?
§ Caches typically use something close to least recently used (LRU)
§ (hardware usually implements “not most recently used”)

1-way associativity 2-way associativity 4-way associativity

8 sets, 1 block each 4 sets, 2 blocks each 2 sets, 4 blocks each

Set Set Set

0
1 0
2 0
1
3
4
2
5 1
6 3
7

46
Autumn 2013 Memory and Caches
University of Washington

Another puzzle.
¢ What can you infer from this:

¢ Cache starts empty

¢ Access (addr, hit/miss) stream

¢ (10, miss); (12, miss); (10, miss)

12 is not in the same 12’s block replaced 10’s block

block as 10

direct-mapped cache
Autumn 2013 Memory and Caches 47
University of Washington

General Cache Organization (S, E, B)

E = 2e lines per set (we say “E-way”)
set

line

S = 2s sets

cache size:
v tag 0 1 2 B-1
S x E x B data bytes
valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 48
University of Washington

Cache Read • Locate set

• Check if any line in set
has matching tag
E = 2e lines per set • Yes + line valid: hit
• Locate data starting
at offset

Address of byte in memory:

t bits s bits b bits

S = 2s sets
tag set block
index offset

data begins at this offset

v tag 0 1 2 B-1

valid bit
B = 2b bytes of data per cache line (the data block)
Autumn 2013 Memory and Caches 49
University of Washington

Example: Direct-Mapped Cache (E = 1)

Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7

Autumn 2013 Memory and Caches 50

University of Washington

Example: Direct-Mapped Cache (E = 1)

Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
valid? + match?: yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

Autumn 2013 Memory and Caches 51

University of Washington

Example: Direct-Mapped Cache (E = 1)

Direct-mapped: One line per set
Assume: cache block size 8 bytes

Address of int:
valid? + match?: yes = hit
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7

block offset

int (4 Bytes) is here

No match: old line is evicted and replaced

Autumn 2013 Memory and Caches 52

University of Washington

Assume sum, i, j in registers

Address of an aligned element
Example (for E = 1) of a: aa...ayyyyxxxx000
Assume: cold (empty) cache
int sum_array_rows(double a[16][16]) 3 bits for set, 5 bits for offset
{
int i, j;
aa...ayyy yxx xx000
double sum = 0; 2,0:
1,0:
0,4:
0,0: aa...a000
aa...a001 001 000
100 00000

for (i = 0; i < 16; i++) 0,0 0,1 0,2 0,3 4,0

0,0 4,1
2,1 0,2
2,0 0,1 4,2 4,3
2,2 0,3
2,3
for (j = 0; j < 16; j++)
sum += a[i][j]; 0,4 0,5 0,6 0,7
return sum; 0,8 0,9 0,a 0,b
} 0,c 0,d 0,e 0,f

int sum_array_cols(double a[16][16]) 1,0 1,1 1,2 1,3 3,0

1,0 3,1
1,1 3,2
1,2 3,3
1,3
{ 1,4 1,5 1,6 1,7
int i, j; 1,8 1,9 1,a 1,b
double sum = 0;
1,c 1,d 1,e 1,f
for (j = 0; j < 16; j++)
for (i = 0; i < 16; i++)
sum += a[i][j]; 32 B = 4 doubles 32 B = 4 doubles
return sum; 4 misses per row of array every access a miss
} 4*16 = 64 misses 16*16 = 256 misses

Autumn 2013 Memory and Caches 53

University of Washington

Example (for E = 1) In this example, cache blocks are

16 bytes; 8 sets in cache
How many block offset bits?
float dotprod(float x[8], float y[8]) How many set index bits?
{
float sum = 0; Address bits: ttt....t sss bbbb
int i; B = 16 = 2b: b=4 offset bits
for (i = 0; i < 8; i++) S = 8 = 2s: s=3 index bits
sum += x[i]*y[i];
return sum; 0: 000....0 000 0000
} 128: 000....1 000 0000
160: 000....1 010 0000

y[0] x[1]
x[0] y[1] x[2] y[3]
y[2] x[3] x[0] x[1] x[2] x[3]
x[4] x[5] x[6] x[7]
if x and y have aligned if x and y have unaligned y[0] y[1] y[2] y[3]
starting addresses, starting addresses,
y[4] y[5] y[6] y[7]
e.g., &x[0] = 0, &y[0] = 128 e.g., &x[0] = 0, &y[0] = 160

Autumn 2013 Memory and Caches 54

University of Washington

E-way Set-Associative Cache (Here: E = 2)

E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7 find set

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

Autumn 2013 Memory and Caches 55

University of Washington

E-way Set-Associative Cache (Here: E = 2)

E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag
tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

Autumn 2013 Memory and Caches 56

University of Washington

E-way Set-Associative Cache (Here: E = 2)

E = 2: Two lines per set
Assume: cache block size 8 bytes Address of short int:
t bits 0…01 100
compare both

valid? + match: yes = hit

v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7

block offset

short int (2 Bytes) is here

No match:
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
Autumn 2013 Memory and Caches 57
University of Washington

Example (for E = 2)
float dotprod(float x[8], float y[8])
{
float sum = 0;
int i;

for (i = 0; i < 8; i++)

sum += x[i]*y[i];
return sum;
}

If x and y have aligned starting x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]
addresses, e.g. &x[0] = 0, &y[0] = 128, x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]
can still fit both because two lines in
each set

Autumn 2013 Memory and Caches 58

University of Washington

Types of Cache Misses

¢ Cold (compulsory) miss
§ Occurs on first access to a block
¢ Conflict miss
§ Conflict misses occur when the cache is large enough, but multiple data
objects all map to the same slot
§ e.g., referencing blocks 0, 8, 0, 8, ... would miss every time
§ direct-mapped caches have more conflict misses than
n-way set-associative (where n is a power of 2 and n > 1)
¢ Capacity miss
§ Occurs when the set of active cache blocks (the working set)
is larger than the cache (just won’t fit)

Autumn 2013 Memory and Caches 59

University of Washington

What about writes?

¢ Multiple copies of data exist:
§ L1, L2, possibly L3, main memory
¢ What is the main problem with that?

Autumn 2013 Memory and Caches 60

University of Washington

What about writes?

¢ Multiple copies of data exist:
§ L1, L2, possibly L3, main memory
¢ What to do on a write-hit?
§ Write-through: write immediately to memory, all caches in between.
§ Write-back: defer write to memory until line is evicted (replaced)
§ Need a dirty bit to indicate if line is different from memory or not
¢ What to do on a write-miss?
§ Write-allocate: load into cache, update line in cache.
§Good if more writes or reads to the location follow
§ No-write-allocate: just write immediately to memory.
¢ Typical caches:
§ Write-back + Write-allocate, usually why?
§ Write-through + No-write-allocate, occasionally
Autumn 2013 Memory and Caches 61
University of Washington

Write-back, write-allocate example

mov 0xFACE, T

Cache U 0xBEEF 0 dirty bit

Memory T 0xCAFE
U 0xBEEF

Autumn 2013 Memory and Caches 62

University of Washington

Write-back, write-allocate example

mov 0xFACE, T mov 0xFEED, T mov U, %rax

Cache T
U 0xFEED
0xCAFE
0xFACE
0xBEEF 0
1 dirty bit

Memory T 0xCAFE
U 0xBEEF

Autumn 2013 Memory and Caches 63

University of Washington

Write-back, write-allocate example

mov 0xFACE, T mov 0xFEED, T mov U, %rax

Cache U 0xBEEF 0 dirty bit

Memory T 0xFEED
U 0xBEEF

Autumn 2013 Memory and Caches 64

University of Washington

Back to the Core i7 to look at ways

Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:

8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

slower, but
Main memory more likely
to hit
Autumn 2013 Memory and Caches 65
University of Washington

Where else is caching used?

Autumn 2013 Memory and Caches 66

University of Washington

Software Caches are More Flexible

¢ Examples
§ File system buffer caches, web browser caches, etc.

¢ Some design differences

§ Almost always fully-associative
§ so, no placement restrictions
§ index structures like hash tables are common (for placement)
§ Often use complex replacement policies
§ misses are very expensive when disk or network involved
§ worth thousands of cycles to avoid them
§ Not necessarily constrained to single “block” transfers
§ may fetch or write-back in larger units, opportunistically

Autumn 2013 Memory and Caches 67

University of Washington

Optimizations for the Memory Hierarchy

¢ Write code that has locality!
§ Spatial: access data contiguously
§ Temporal: make sure access to the same data is not too far apart in time
¢ How can you achieve locality?
§ Proper choice of algorithm
§ Loop transformations

Autumn 2013 Memory and Caches 68

University of Washington

Example: Matrix Multiplication

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */

void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
for (k = 0; k < n; k++)
c[i*n + j] += a[i*n + k]*b[k*n + j];
}

j j
c a b
=i *
i

memory access pattern?

Autumn 2013 Memory and Caches 69
University of Washington

spatial locality:
Cache Miss Analysis chunks of 8 items in a row
in same cache line
¢ Assume: each item in column in
§ Matrix elements are doubles different cache line
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n, not left-shifted by n)
n/8 misses n
¢ First iteration: …
§ n/8 + n = 9n/8 misses

n misses
(omitting matrix c)
= *
§ Afterwards in cache:
(schematic)
= *
8 wide
Autumn 2013 Memory and Caches 70
University of Washington

Cache Miss Analysis

¢ Assume:
§ Matrix elements are doubles
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
n
¢ Other iterations:
§ Again:
n/8 + n = 9n/8 misses
(omitting matrix c) = *
8 wide

¢ Total misses:
§ 9n/8 * n2 = (9/8) * n3

Autumn 2013 once per element

Memory and Caches 71
University of Washington

Blocked Matrix Multiplication

c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */

void mmm(double *a, double *b, double *c, int n) {
int i, j, k;
for (i = 0; i < n; i+=B)
for (j = 0; j < n; j+=B)
for (k = 0; k < n; k+=B)
/* B x B mini matrix multiplications */
for (i1 = i; i1 < i+B; i1++)
for (j1 = j; j1 < j+B; j1++)
for (k1 = k; k1 < k+B; k1++)
c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1];
}

j1
c a b
= i1 *

Block size B x B
Autumn 2013 Memory and Caches 72
University of Washington

Cache Miss Analysis

¢ Assume:
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
§ Three blocks fit into cache: 3B2 < C
B2 elements per block, 8 per cache line
n/B blocks
¢ First (block) iteration:
§ B2/8 misses for each block
§ 2n/B * B2/8 = nB/4
(omitting matrix c) = *
n/B blocks per row,
n/B blocks per column
Block size B x B
§ Afterwards in cache
(schematic)
= *
Autumn 2013 Memory and Caches 73
University of Washington

Cache Miss Analysis

¢ Assume:
§ Cache block = 64 bytes = 8 doubles
§ Cache size C << n (much smaller than n)
§ Three blocks fit into cache: 3B2 < C

n/B blocks
¢ Other (block) iterations:
§ Same as first iteration
§ 2n/B * B2/8 = nB/4
= *

¢ Total misses: Block size B x B

§ nB/4 * (n/B)2 = n3/(4B)

Autumn 2013 Memory and Caches 74

University of Washington

Summary
¢ No blocking: (9/8) * n3
¢ Blocking: 1/(4B) * n3
¢ If B = 8 difference is 4 * 8 * 9 / 8 = 36x
¢ If B = 16 difference is 4 * 16 * 9 / 8 = 72x

¢ Suggests largest possible block size B, but limit 3B2 < C!

¢ Reason for dramatic difference:

§ Matrix multiplication has inherent temporal locality:
§ Input data: 3n2, computation 2n3
§ Every array element used O(n) times!
§ But program has to be written properly

Autumn 2013 Memory and Caches 75

University of Washington

Cache-Friendly Code
¢ Programmer can optimize for cache performance
§ How data structures are organized
§ How data are accessed
§ Nested loop structure
§ Blocking is a general technique
¢ All systems favor “cache-friendly code”
§ Getting absolute optimum performance is very platform specific
§Cache sizes, line sizes, associativities, etc.
§ Can get most of the advantage with generic code
§ Keep working set reasonably small (temporal locality)
§ Use small strides (spatial locality)
§ Focus on inner loop code

Autumn 2013 Memory and Caches 76

University of Washington

Intel Core i7 Cache Hierarchy

Processor package
Core 0 Core 3 L1 i-cache and d-cache:
32 KB, 8-way,
Regs Regs Access: 4 cycles

L1 L1 L1 L1 L2 unified cache:
d-cache i-cache
… d-cache i-cache 256 KB, 8-way,
Access: 11 cycles

L2 unified cache L2 unified cache L3 unified cache:

8 MB, 16-way,
Access: 30-40 cycles
L3 unified cache
(shared by all cores) Block size: 64 bytes for
all caches.

Main memory
Autumn 2013 Memory and Caches 77
University of Washington

Intel Core i7
The Memory Mountain 32 KB L1 i-cache
32 KB L1 d-cache
256 KB unified L2 cache
8M unified L3 cache
Read throughput (MB/s)

7000
L1 All caches on-chip
6000

5000

4000
L2
3000

2000
L3
1000

2K
Mem
s1

16K
s3
s5

128K
s7
s9

1M
s11
s13

8M
s15

Stride (x8 bytes) Working set size (bytes)

s32
64M

Autumn 2013 Memory and Caches 78

Google Earth Engine Applications
100% (2)
Google Earth Engine Applications
422 pages
Memory Hierarchy
100% (1)
Memory Hierarchy
47 pages
Cache Memory Presentation Slides
No ratings yet
Cache Memory Presentation Slides
25 pages
Topic 1 (Whole Numbers) - Y4
No ratings yet
Topic 1 (Whole Numbers) - Y4
23 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Computer Architecture and Organization: Lecture12: Locality and Caching
No ratings yet
Computer Architecture and Organization: Lecture12: Locality and Caching
17 pages
12 Caches Notes
No ratings yet
12 Caches Notes
144 pages
Lecture 11. Memory Hierarchy
No ratings yet
Lecture 11. Memory Hierarchy
107 pages
12 Caches Notes
No ratings yet
12 Caches Notes
144 pages
CH 5
No ratings yet
CH 5
116 pages
Security Manual
100% (1)
Security Manual
16 pages
IT3030E CA Chap6 Memory
No ratings yet
IT3030E CA Chap6 Memory
72 pages
IT3030E CA Chap6 Memory
No ratings yet
IT3030E CA Chap6 Memory
69 pages
Computer Architecture: Memory Hierarchy Design
No ratings yet
Computer Architecture: Memory Hierarchy Design
60 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Ldco Unit 6 Notes
No ratings yet
Ldco Unit 6 Notes
44 pages
Week 11
No ratings yet
Week 11
45 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Cache
No ratings yet
Cache
35 pages
Lecture 03
No ratings yet
Lecture 03
37 pages
Lecture11 Cda3101
No ratings yet
Lecture11 Cda3101
73 pages
CA Chap5 Memory
No ratings yet
CA Chap5 Memory
64 pages
Cache Memory
No ratings yet
Cache Memory
89 pages
6.module 2 - Part 2
No ratings yet
6.module 2 - Part 2
39 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
No ratings yet
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
77 pages
L-4 (Cache Memory)
No ratings yet
L-4 (Cache Memory)
61 pages
Class11 Cache
No ratings yet
Class11 Cache
41 pages
10 Caches
No ratings yet
10 Caches
34 pages
L15 Cache Introduction
No ratings yet
L15 Cache Introduction
35 pages
Memory 2
No ratings yet
Memory 2
31 pages
Chapter 3 P1
No ratings yet
Chapter 3 P1
57 pages
Huawei MV Oss-Global Case Stories1 PDF
No ratings yet
Huawei MV Oss-Global Case Stories1 PDF
40 pages
Cache
No ratings yet
Cache
36 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
09 MemSubsystem PDF
No ratings yet
09 MemSubsystem PDF
100 pages
Lecture 14
No ratings yet
Lecture 14
14 pages
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
No ratings yet
Lecture 10: Memory System - Memory Technology: CSE 564 Computer Architecture Summer 2017
44 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Cache Memory: A Safe Place For Hiding or Storing Things
No ratings yet
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Lecture Slides 07 071-Caches-Basics
No ratings yet
Lecture Slides 07 071-Caches-Basics
11 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
21 pages
Welcome To Part 3: Memory Systems and I/O
No ratings yet
Welcome To Part 3: Memory Systems and I/O
31 pages
Lecture Slides 07 071-Caches-Basics
No ratings yet
Lecture Slides 07 071-Caches-Basics
11 pages
Final Chapter-5
No ratings yet
Final Chapter-5
9 pages
Systems I: Locality and Caching
No ratings yet
Systems I: Locality and Caching
18 pages
CPSC 312 Cache Memories: Topics
No ratings yet
CPSC 312 Cache Memories: Topics
39 pages
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Lecture 12 - Memory Hierarchy/Direct-Mapped Caches
27 pages
ASA Chapter4
No ratings yet
ASA Chapter4
8 pages
Memory Architecture: Chapter 5 in Hennessy & Patterson
No ratings yet
Memory Architecture: Chapter 5 in Hennessy & Patterson
23 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Lecture Slides 07 073-Caches-Hierar
No ratings yet
Lecture Slides 07 073-Caches-Hierar
7 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
Large and Fast: Exploiting Memory Hierarchy
No ratings yet
Large and Fast: Exploiting Memory Hierarchy
48 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Transmitting Loop Antenna For The 40M Band
No ratings yet
Transmitting Loop Antenna For The 40M Band
12 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Codigo de Barras EP2000
No ratings yet
Codigo de Barras EP2000
48 pages
2014-Instruction of MAGYN Electromagnetic Flowemter
No ratings yet
2014-Instruction of MAGYN Electromagnetic Flowemter
66 pages
CH-3 Syntax Analyzer
No ratings yet
CH-3 Syntax Analyzer
41 pages
Smart India Hackathon 2024
No ratings yet
Smart India Hackathon 2024
6 pages
3D-Modeling For Crane Selection and Logistics For Modular Construction On-Site Assembly
No ratings yet
3D-Modeling For Crane Selection and Logistics For Modular Construction On-Site Assembly
8 pages
Btzs
No ratings yet
Btzs
5 pages
3 Tobias Grosser 2017 Day1
No ratings yet
3 Tobias Grosser 2017 Day1
136 pages
2 Hal Finkel LLVM 2017
No ratings yet
2 Hal Finkel LLVM 2017
134 pages
Lecture 8: AES: The Advanced Encryption Standard Lecture Notes On "Computer and Network Security"
No ratings yet
Lecture 8: AES: The Advanced Encryption Standard Lecture Notes On "Computer and Network Security"
92 pages
11 Memallocation
No ratings yet
11 Memallocation
77 pages
08 GT I9070 Tshoo 7
No ratings yet
08 GT I9070 Tshoo 7
49 pages
CTSD C03 &co4
No ratings yet
CTSD C03 &co4
38 pages
3 Tobias Grosser 2017 Day2
No ratings yet
3 Tobias Grosser 2017 Day2
122 pages
01 Popov AWhirlwindTour oftheLLVMOptimizer
No ratings yet
01 Popov AWhirlwindTour oftheLLVMOptimizer
109 pages
03 Integersfloats
No ratings yet
03 Integersfloats
99 pages
F
No ratings yet
F
124 pages
TAZC164 2sem2015 MidReg 28022016
No ratings yet
TAZC164 2sem2015 MidReg 28022016
2 pages
01 Intro
No ratings yet
01 Intro
27 pages
YGT-IT Training Material
No ratings yet
YGT-IT Training Material
89 pages
C# Concepts
No ratings yet
C# Concepts
2 pages
Chapter7 MultipleResources
No ratings yet
Chapter7 MultipleResources
48 pages
13 Wrapup
No ratings yet
13 Wrapup
21 pages
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
No ratings yet
Review Paper: Virtual Autopsy: A New Trend in Forensic Investigation
7 pages
Units (Objective & Descriptive) (2024-25)
No ratings yet
Units (Objective & Descriptive) (2024-25)
78 pages
Road Damage Detection and Classification
No ratings yet
Road Damage Detection and Classification
11 pages
2023 Full Planner Print
No ratings yet
2023 Full Planner Print
139 pages
Yealink T55A Teams Phone Edition User Guide V15.85
No ratings yet
Yealink T55A Teams Phone Edition User Guide V15.85
51 pages
03 01 PatMax Logic
No ratings yet
03 01 PatMax Logic
15 pages
Detailed Analysis
No ratings yet
Detailed Analysis
3 pages
Programming Fundamentals PDF
No ratings yet
Programming Fundamentals PDF
56 pages
END Semester Lab Exam EVEN 2025
No ratings yet
END Semester Lab Exam EVEN 2025
1 page
TOPMODEL User Notes: Windows Version 97.01
No ratings yet
TOPMODEL User Notes: Windows Version 97.01
15 pages
Assignment - 4 - Risk Response, Contingency and Control
No ratings yet
Assignment - 4 - Risk Response, Contingency and Control
4 pages
Edp 1 PDF
No ratings yet
Edp 1 PDF
10 pages
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

08 Caches

Uploaded by

08 Caches

Uploaded by

University of Washington

Roadmap Memory & data

Autumn 2013 Memory and Caches 1

How does execution time grow with SIZE?

for (int i = 0 ; i < 200000 ; ++ i) {

Making memory accesses fast!

Autumn 2013 Memory and Caches 4

Problem: Processor-Memory Bottleneck

Core 2 Duo: Core 2 Duo:

Problem: lots of waiting on memory

Problem: Processor-Memory Bottleneck

Core 2 Duo: Core 2 Duo:

¢ CSE definition: computer memory with short access time

used to optimize data transfers between system elements

Autumn 2013 Memory and Caches 7

General Cache Mechanics

Smaller, faster, more expensive

Data is copied in block-sized

Larger, slower, cheaper memory

Autumn 2013 Memory and Caches 8

General Cache Concepts: Hit

Autumn 2013 Memory and Caches 9

General Cache Concepts: Miss

Block b is not in cache:

Block b is fetched from

Block b is stored in cache

Autumn 2013 Memory and Caches 10

Why Caches Work

Autumn 2013 Memory and Caches 11

Why Caches Work

§ Why is this important?

Autumn 2013 Memory and Caches 12

Why Caches Work

Autumn 2013 Memory and Caches 13

Why Caches Work

§ How do caches take advantage of this?

Autumn 2013 Memory and Caches 14

¢ Being able to assess the locality of code is a crucial skill

Autumn 2013 Memory and Caches 16

Autumn 2013 Memory and Caches 18

for (i = 0; i < N; i++)

¢ What is wrong with this code?

Autumn 2013 Memory and Caches 20

Cost of Cache Misses

¢ Would you believe 99% hits is twice as good as 97%?

Autumn 2013 Memory and Caches 21

Cost of Cache Misses

¢ Would you believe 99% hits is twice as good as 97%?

check the cache every time

¢ This is why “miss rate” is used instead of “hit rate”

Cache Performance Metrics

Autumn 2013 Memory and Caches 23

Can we have more than one cache?

Autumn 2013 Memory and Caches 24

¢ These properties complement each other beautifully

¢ They suggest an approach for organizing memory and storage

Autumn 2013 Memory and Caches 25

An Example Memory Hierarchy

registers CPU registers hold words retrieved from L1 cache

Larger, main memory

remote secondary storage

Autumn 2013 Memory and Caches 26

An Example Memory Hierarchy

Larger, main memory

remote secondary storage

Autumn 2013 Memory and Caches 27

Autumn 2013 Memory and Caches 28

Intel Core i7 Cache Hierarchy

L2 unified cache L2 unified cache L3 unified cache:

Where should we put data in the cache?

Where should we put data in the cache?

Use tags to record which location is cached

What’s a cache block? (or cache line)

¢ Cache starts empty

¢ (10, miss), (11, hit), (12, miss)

block size >= 2 bytes block size < 8 bytes