0% found this document useful (0 votes)
35 views35 pages

Lecture05 Memory Hierarchy Cache

computer systems organization

Uploaded by

doublefelix921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views35 pages

Lecture05 Memory Hierarchy Cache

computer systems organization

Uploaded by

doublefelix921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Memory Hierarchy

Computer Systems Organization (Spring 2016)


CSCI-UA 201, Section 2

Instructor: Joanna Klukowska

Slides adapted from


Randal E. Bryant and David R. OHallaron (CMU)
Mohamed Zahran (NYU)

Cache Memory Organization and Access

Example Memory
Hierarchy L0:
Smaller,
faster,
and
costlier
(per byte)
storage
devices

Larger,
slower,
and
cheaper
(per byte)
storage L5:
devices

L6:

L1:
L2:
L3:

Regs

L1 cache
(SRAM)

CPU registers hold words


retrieved from the L1 cache.

L2 cache
(SRAM)

L1 cache holds cache lines


retrieved from the L2 cache.
L2 cache holds cache lines
retrieved from L3 cache

L3 cache
(SRAM)
L3 cache holds cache lines
retrieved from main memory.

L4:

Main memory
(DRAM)

Main memory holds


disk blocks retrieved
from local disks.

Local secondary storage


(local disks)

Remote secondary storage


(e.g., Web servers)

Local disks hold files


retrieved from disks
on remote servers

General Cache Concept

Cache

84

Data is copied in block-sized


transfer units

10
4

Memory

14
10

Smaller, faster, more expensive


memory caches a subset of
the blocks

10

11

12

13

14

15

Larger, slower, cheaper memory


viewed as partitioned into blocks

Cache Memories

Cache memories are small, fast SRAM-based memories managed


automatically in hardware
Hold frequently accessed blocks of main memory

CPU looks first for data in cache

Typical system structure:

CPU chip
Register file
Cache
memory

AL
U
System bus

Bus interface

I/O
bridge

Memory bus
Main
memor
y
5

General Cache Organization (S, E, B)


E = 2e lines per set
set
line

S = 2s sets

valid bit

tag

0 1 2

B-1

Cache size:
C = S x E x B data bytes

B = 2b bytes per cache block (the data)

Cache Read
E = 2e lines per set

Locate set
Check if any line in set
has matching tag
Yes + line valid: hit
Locate data starting
at offset
Address of word:
t bits

S = 2s sets

tag

s bits

b bits

set block
index offset

data begins at this offset

tag

0 1 2

B-1

valid bit
B = 2b bytes per cache block (the data)

Example: Direct Mapped Cache (E = 1)


Direct mapped: One line per set
Assume: cache block size 8 bytes

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

Address of int:
t bits

001

100

find set

S = 2s sets

Example: Direct Mapped Cache (E = 1)


Direct mapped: One line per set
Assume: cache block size 8 bytes

valid? + match: assume yes = hit


v

tag

Address of int:
t bits

001

100

0 1 2 3 4 5 6 7

block offset

Example: Direct Mapped Cache (E = 1)


Direct mapped: One line per set
Assume: cache block size 8 bytes

valid? + match: assume yes = hit


v

tag

Address of int:
t bits

001

100

0 1 2 3 4 5 6 7

block offset
int (4 Bytes) is here

If tag doesnt match: old line is evicted and replaced


10

Example: Direct-Mapped Cache Simulation


t=1
x

s=2
xx

b=1
x

M=16 bytes (4-bit addresses), B=2 bytes/block,


S=4 sets, E=1 Blocks/set

Address trace (reads, one byte per read):


0 [00002],
miss
1 [00012],
hit
7 [01112],
miss
8 [10002],
miss
0 [00002]
miss

Set 0
Set 1
Set 2
Set 3

v
0
1

Ta
g0
1?

Block
?
M[8-9]
M[0-1]

M[6-7]
11

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes

Address of short int:


t bits

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

001

100

find set

12

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes

Address of short int:


t bits

compare both

001

100

valid? + match: yes = hit


v

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

block offset

13

E-way Set Associative Cache (Here: E = 2)


E = 2: Two lines per set
Assume: cache block size 8 bytes

Address of short int:


t bits

compare both

001

100

valid? + match: yes = hit


v

tag

0 1 2 3 4 5 6 7

tag

0 1 2 3 4 5 6 7

block offset
short int (2 Bytes) is here

No match:
One line in set is selected for eviction and replacement
Replacement policies: random, least recently used (LRU),

14

Example: 2-Way Set Associative Cache


Simulation
t=2
xx

s=1
x

b=1
x

M=16 byte addresses, B=2 bytes/block,


S=2 sets, E=2 blocks/set
Address trace (reads, one byte per read):
0 [00002],
miss
1 [00012],
hit
7 [01112],
miss
8 [10002],
miss
0 [00002]
hit
v
0
Set 0 1
1
0

Ta
?g
00
10

Block
?
M[0-1]
M[8-9]

1
Set 1 0

01

M[6-7]

15

What about writes?

Multiple copies of data exist:


L1, L2, L3, Main Memory, Disk

What to do on a write-hit?
Write-through (write immediately to memory)
Write-back (defer write to memory until replacement of line)

Need a dirty bit (line different from memory or not)

What to do on a write-miss?
Write-allocate (load into cache, update line in cache)
Good if more writes to the location follow
No-write-allocate (writes straight to memory, does not load into cache)

Typical
Write-through + No-write-allocate
Write-back + Write-allocate
16

Intel Core i7 Cache Hierarchy


Processor package
Core 0
Regs
L1
d-cache

L1 i-cache and d-cache:


32 KB, 8-way,
Access: 4 cycles

Core 3
Regs
L1
i-cache

L2 unified cache

L1
d-cache

L1
i-cache

L2 unified cache

L3 unified cache
(shared by all cores)

Main memory

L2 unified cache:
256 KB, 8-way,
Access: 10 cycles
L3 unified cache:
8 MB, 16-way,
Access: 40-75 cycles
Block size: 64 bytes for
all caches.

17

Cache Performance Metrics

Miss Rate
Fraction of memory references not found in cache (misses / accesses)
= 1 hit rate
Typical numbers (in percentages):
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time
Time to deliver a line in the cache to the processor
includes time to determine whether the line is in the cache
Typical numbers:
4 clock cycle for L1
10 clock cycles for L2

Miss Penalty
Additional time required because of a miss

typically 50-200 cycles for main memory (Trend: increasing!)


18

Lets think about those numbers

Huge difference between a hit and a miss


Could be 100x, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%?


Consider:
cache hit time of 1 cycle
miss penalty of 100 cycles

Average access time:


97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

This is why miss rate is used instead of hit rate


19

Writing Cache Friendly Code

Make the common case go fast


Focus on the inner loops of the core functions

Minimize the misses in the inner loops


Repeated references to variables are good (temporal locality) because
there is a good chance that they are stored in registers.
Stride-1 reference patterns are good (spatial locality) because
subsequent references to elements in the same block will be able to hit the
cache (one cache miss followed by many cache hits).

20

Today

Cache organization and operation


Performance impact of caches
The memory mountain
Rearranging loops to improve spatial locality
Using blocking to improve temporal locality

21

The Memory Mountain

Read throughput (read bandwidth)


Number of bytes read from memory per second (MB/s)

Memory mountain: Measured read throughput as a function of


spatial and temporal locality.
Compact way to characterize memory system performance.

22

Rearranging Loops
to Improve Spatial Locality

23

Matrix Multiplication Example


Variable sum
held in register

/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

Description:
Multiply N x N matrices
Matrix elements are doubles (8 bytes)
O(N3) total operations
N reads per source element
N values summed per destination
but may be able to hold in register
24

Miss Rate Analysis for Matrix Multiply

Assume:
Block size = 32B (big enough for four doubles)
Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple rows

Analysis Method:
Look at access pattern of inner loop

j
k

B
25

Layout of C Arrays in Memory (review)

C arrays allocated in row-major order


each row in contiguous memory locations

Stepping through columns in one row:


for (i = 0; i < N; i++)
sum += a[0][i];
accesses successive elements
if block size (B) > sizeof(aij) bytes, exploit spatial locality
miss rate = sizeof(aij) / B

Stepping through rows in one column:


for (i = 0; i < n; i++)
sum += a[i][0];
accesses distant elements
no spatial locality!
miss rate = 1 (i.e. 100%)
26

Matrix Multiplication (ijk)


/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

Inner loop:
(*,j)
(i,j)

(i,*)
A

Row-wise Columnwise

Fixed

Misses per inner loop iteration:


A
B
C
0.25 1.0
0.0
27

Matrix Multiplication (jik)


/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}

Inner loop:
(*,j)
(i,j)

(i,*)
A

Row-wise Columnwise

Fixed

Misses per inner loop iteration:


A
B
C
0.25 1.0
0.0
28

Matrix Multiplication (kij)


/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}

Inner loop:
(i,k)
A

Fixed

(k,*)
B

(i,*)
C

Row-wise Row-wise

Misses per inner loop iteration:


A
B
C
0.0
0.25 0.25
29

Matrix Multiplication (ikj)


/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}

Inner loop:
(i,k)
A

Fixed

(k,*)
B

(i,*)
C

Row-wise Row-wise

Misses per inner loop iteration:


A
B
C
0.0
0.25 0.25
30

Matrix Multiplication (jki)


/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}

Inner loop:
(*,k)

(*,j)
(k,j)

Columnwise

Fixed

Columnwise

Misses per inner loop iteration:


A
B
C
1.0
0.0
1.0
31

Matrix Multiplication (kji)


/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}

Inner loop:
(*,k)

(*,j)
(k,j)

Columnwise

Fixed

Columnwise

Misses per inner loop iteration:


A
B
C
1.0
0.0
1.0
32

Summary of Matrix Multiplication


for (i=0; i<n; i++) {
ijk (& jik):
for (j=0; j<n; j++) {
2 loads, 0 stores
sum = 0.0;
misses/iter = 1.25
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
kij (& ikj):
r = a[i][k];
for (j=0; j<n; j++)
2 loads, 1 store
c[i][j] += r * b[k][j];
misses/iter = 0.5
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}

jki (& kji):


2 loads, 1 store
misses/iter = 2.0

33

Core i7 Matrix Multiply Performance


jki / kji

ijk / jik

kij / ikj

34

Cache Summary

Cache memories can have significant performance impact

You can write your programs to exploit this!


Focus on the inner loops, where bulk of computations and memory
accesses occur.
Try to maximize spatial locality by reading data objects with
sequentially with stride 1.
Try to maximize temporal locality by using a data object as often as
possible once its read from memory.

You might also like