0% found this document useful (0 votes)

27 views97 pages

Recitation05 Cachelab

The document outlines a recitation on cache concepts and blocking for a computer science course, detailing logistics, learning objectives, and activities. It includes information on a Cache Lab assignment, covering trace files, cache organization, and matrix multiplication with blocking. Additionally, it provides examples and practice problems related to cache reads and matrix operations.

Uploaded by

Nyambuu Tuvshinsanaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views97 pages

Recitation05 Cachelab

Uploaded by

Nyambuu Tuvshinsanaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 97

15-213 Recitation

Caches and Blocking

Nishi, Sabrina, Pallavi, Abi

Monday, October 12th, 2020
Agenda
■ Logistics
■ Ca$h Lab
■ Cache Concepts
■ Activity 1: Traces
■ Activity 2: Blocking
■ Practice Problems
■ Appendix: Examples, Style, Git, fscanf
Learning Objectives
By the end of this recitation, we want you to know:
■ Cache concepts
■ Basic cache organization
■ Read and write trace files
■ Blocking concepts
■ Matrix multiplication with blocking
Logistics

■ Cache Lab is due Tuesday, Oct. 20th at 11:59pm

■ Drop date is TODAY!!
Cache Lab: Overview
■ Part 0: Write trace files for testing
■ Short and quick to familiarize yourself with the trace files
■ Extremely helpful for debugging later on!
■ Part 1: Write a cache simulator
■ Substantial amount of C code!
■ Part 2: Optimize some code to minimize cache misses
■ Substantial amount of thinking!
■ Part 3: Style Grades
■ Worth about a letter grade on this assignment
■ Few examples in appendix
■ Full guide on course website
■ Git matters!
Cache Lab: Cache Simulator Hints
■ Goal: Count hits, misses, evictions and # of dirty bytes
■ Procedure
■ Least Recently Used (LRU) replacement policy
■ Structs are good for storing cache line parts (valid bit, tag, LRU counter, etc.)
■ A cache is like a 2D array of cache lines
struct cache_line cache[S][E];
■ Your simulator needs to handle different values of S, E, and b (block size) given at
run time
■ Dynamically allocate memory!
■ Dirty bytes: any payload byte whose corresponding cache block’s dirty bit is set
(i.e. the payload of that block has been modified, but not yet written back to main
memory)
Carnegie Mellon

Cache Concepts
Cache Organization
E = 2e lines/set

“line” or “block”

S = 2S sets

V D Tag 0 1 2 3 .. B-1 B = 2b bytes per block

Cache Read

■ Address of word: | t bits | s bits | b bits |

■ Tag: t bits
■ Set index: s bits
■ Block offset: b bits
■ Steps:
■ Use set index to get appropriate set
■ Loop through lines in set to find matching tag
■ If found and valid bit is set: hit
■ Locate data starting at block offset
Tying it all together: Bomblab
Tying it all together: Bomblab
Tying it all together: Bomblab
For the L1 dCache (data)

C = 32768 (32 KiB)

E=8
B = 64
S = 64

How did we get S?

Tying it all together: Bomblab

● 64 bit address space: m = 64

● b=6
● s=6
● t = 52
Tying it all together: Bomblab
0x00604420 → 0b0000000011000000100010000100000

● tag bits: 0000000011000000100

● set index bits: 010000
● block offset bits: 100000
Carnegie Mellon

Activity 1: Traces
Carnegie Mellon

Tracing a Cache
Example Cache: -s 1 -E 2 -b 2 (S=2 B=4)
E=0 E=1

b=00 b=01 b=10 b=11 b=00 b=01 b=10 b=11

S=0

S=1 b=00 b=01 b=10 b=11 b=00 b=01 b=10 b=11

Carnegie Mellon

Example Trace
L - Load Jack.trace
S - Store L 0,4
S 0,4
Memory Location L 0,1
L 6,1
Size L 5,1
L 6,1
L 7,1
Carnegie Mellon

Example Trace

Jack.trace
L 0,4
S 0,4
L 0,1 Memory

L 6,1 0x00 0x01 0x02 0x03

L 5,1 15 21 13 18

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace
L 0,4 M
S 0,4
L 0,1 Memory
Why that line?
L 6,1 0x00 0x01 0x02 0x03 Where are those values
L 5,1 15 21 13 18
from?

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace
L 0,4 M
S 0,4 H
L 0,1 Memory
What happens if values
L 6,1 0x00 0x01 0x02 0x03 change?
L 5,1 15 21 13 18

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace
L 0,4 M
S 0,4 H
L 0,1 H Memory
Why is this still a hit?
L 6,1 0x00 0x01 0x02 0x03

L 5,1 15 21 13 18
What would happen if we had
not previously loaded all four
L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3 bytes?

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace ac

L 0,4 M
S 0,4 H
L 0,1 H Memory
Just one Byte?
L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 15 21 13 18

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace ac

L 0,4 M
S 0,4 H
L 0,1 H Memory
Just one Byte?
L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 NO!
15 21 13 18

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace 51 30 ac b3

L 0,4 M
S 0,4 H
L 0,1 H Memory
Why below and not above?
L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 15 21 13 18
Why load all four bytes?

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace 51 30 ac b3

L 0,4 M
S 0,4 H
L 0,1 H Memory

L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 H 15 21 13 18

L 6,1 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace 51 30 ac b3

L 0,4 M
S 0,4 H
L 0,1 H Memory

L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 H 15 21 13 18

L 6,1 H 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1
Carnegie Mellon

Example Trace

15 21 13 18

Jack.trace 51 30 ac b3

L 0,4 M
S 0,4 H
L 0,1 H Memory

L 6,1 M 0x00 0x01 0x02 0x03

L 5,1 H 15 21 13 18

L 6,1 H 0x04
51
0x05
30
0x06
ac
0x07
b3

L 7,1 H
Carnegie Mellon

Example Trace

15 21 13 18

51 30 ac b3

Jack2.trace
Memory
L 8,4 M
0x00 0x01 0x02 0x03
15 21 13 18 What would happen if we
loaded from memory address
0x04 0x05 0x06 0x07 0x08?
51 30 ac b3

0x08 0x09 0x0a 0x0b

de ad be ef
Carnegie Mellon

Example Trace

15 21 13 18 de ad be ef

51 30 ac b3

Jack2.trace
Memory
L 8,4 M
0x00 0x01 0x02 0x03
15 21 13 18 What would happen if we
loaded from memory address
0x04 0x05 0x06 0x07 0x08?
51 30 ac b3

0x08 0x09 0x0a 0x0b

de ad be ef
Carnegie Mellon

Activity 2: Blocking
Example: Matrix Multiplication
/* multiply 4x4 matrices */
void mm(int a[4][4], int b[4][4], int c[4][4]) {
int i, j, k;
for (i = 0; i < 4; i++)
for (j = 0; j < 4; j++)
for (k = 0; k < 4; k++)
c[i][j] += a[i][k] * b[k][j];

Let’s step through this to see what’s actually happening

Example: Matrix Multiplication

■ Assume a tiny cache with 4 lines of 8 bytes (2 ints)

■ S = 1, E = 4, B = 8
■ 3 b bits, no s bits, rest are tags!
■ Let’s see what happens if we don’t use blocking
c a l

= x

iter i j k Key:
operation a l Grey = accessed
0 0 0 0 c[0][0] += a[0] Dark grey = currently accessing
[0] * l[0][0] M M Red border = in cache
c a l

= x

iter i j k Key:
operation a l Grey = accessed
0 0 0 0 c[0][0] += a[0] Dark grey = currently accessing
[0] * l[0][0] M M Red border = in cache
1 0 0 1 c[0][0] += a[0]
[1] * l[1][0] H M
c a l

= x

iter i j k Key:
operation a l Grey = accessed
0 0 0 0 c[0][0] += a[0] Dark grey = currently accessing
[0] * l[0][0] M M Red border = in cache
1 0 0 1 c[0][0] += a[0]
[1] * l[1][0] H M
2 0 0 2 c[0][0] += a[0]
[2] * l[2][0] M M
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 c[0][0]M+= Dark grey = currently accessing
a[0][0] * l[0][0] H Red border = in cache
1 0 0 1 c[0][0]M+=
a[0][1] * l[1][0] M
2 0 0 2 c[0][0]M+=
a[0][2] * l[2][0] H
3 0 0 3 c[0][0]M+=
a[0][3] * l[3][0]
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 c[0][0]M+= a[0] Dark grey = currently accessing
[0] * l[0][0] H Red border = in cache
1 0 0 1 c[0][0]M+= a[0]
[1] * l[1][0] M
2 0 0 2 c[0][0]M+= a[0]
[2] * l[2][0] H
3 0 0 3 c[0][0]M+= a[0]
[3] * l[3][0] M
4 0 1 0 c[0][1]M+= a[0]
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 c[0][0]M+= Dark grey = currently accessing
a[0][0] * l[0][0] H Red border = in cache
1 0 0 1 c[0][0]M+=
a[0][1] * l[1][0] M What is the miss rate of a?
2 0 0 2 c[0][0]M+=
a[0][2] * l[2][0] H
3 0 0 3 c[0][0]M+= What is the miss rate of l?
a[0][3] * l[3][0] M
4 0 1 0 c[0][1]M+=
Example: Matrix Multiplication (blocking)
/* multiply 4x4 matrices using blocks of size 2 */
void mm_blocking(int a[4][4], int b[4][4], int c[4][4]) {
int i, j, k;
int i_c, j_c, k_c;
int B = 2;
// control loops
for (i_c = 0; i_c < 4; i_c += B)
for (j_c = 0; j_c < 4; j_c += B)
for (k_c = 0; k_c < 4; k_c += B)
// block multiplications
for (i = i_c; i < i_c + B; i++)
for (j = j_c; j < j_c + B; j++)
for (k = k_c; k < k_c + B; k++)
c[i][j] += a[i][k] * b[k][j];

Let’s step through this to see what’s actually happening

Example: Matrix Multiplication (blocking)

■ Assume a tiny cache with 4 lines of 8 bytes (2 ints)

■ S = 1, E = 4, B = 8
■ Let’s see what happens if we now use blocking
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 M Dark grey = currently accessing
c[0][0] += a[0][0] * l[0][0] Red border = in cache
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 M Dark grey = currently accessing
c[0][0] += a[0][0] * l[0][0] H Red border = in cache
1 0 0 1 M
c[0][0] += a[0][1] * l[1][0]
c a l

= x

iter i j k a l Key:
operation M Grey = accessed
0 0 0 0 M Dark grey = currently accessing
c[0][0] += a[0][0] * l[0][0] H Red border = in cache
1 0 0 1 M
c[0][0] += a[0][1] * l[1][0] H
2 0 1 0 H
c[0][1] += a[0][0] * l[0][1]
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 c[0][0]
M +=
0 0 0 0 M a[0][2] * l[2][0]
c[0][0] += a[0][0] * l[0][0] H
1 0 0 1 M
c[0][0] += a[0][1] * l[1][0] H
2 0 1 0 H
c[0][1] += a[0][0] * l[0][1] H
3 0 1 1 H
c[0][1] += a[0][1] * l[1][1] M
4 1 0 0 H
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 M +=
2 c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 M +=
3 c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0]
1 0 0 1 M
c[0][0] += a[0][1] * l[1][0] H
2 0 1 0 H
c[0][1] += a[0][0] * l[0][1] H
3 0 1 1 H
c[0][1] += a[0][1] * l[1][1] M
4 1 0 0 H
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 M +=
2 c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 M +=
3 c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 H +=
2 c[0][1]
c[0][0] += a[0][1] * l[1][0] H a[0][2] * l[2][1]

2 0 1 0 H
c[0][1] += a[0][0] * l[0][1] H
3 0 1 1 H
c[0][1] += a[0][1] * l[1][1] M
4 1 0 0 H
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 M +=
c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 3 M +=
c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 2 H +=
c[0][1]
c[0][0] += a[0][1] * l[1][0] H a[0][2] * l[2][1] H
11 0 1 3 H +=
c[0][1]
2 0 1 0 H a[0][3] * l[3][1] M
c[0][1] += a[0][0] * l[0][1] H 12 1 0 2 H +=
c[1][0]
a[1][2] * l[2][0]
3 0 1 1 H
c[0][1] += a[0][1] * l[1][1] M
4 1 0 0 H
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 M +=
c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 3 M +=
c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 2 H +=
c[0][1]
c[0][0] += a[0][1] * l[1][0] H a[0][2] * l[2][1] H
11 0 1 3 H +=
c[0][1]
2 0 1 0 H a[0][3] * l[3][1] M
c[0][1] += a[0][0] * l[0][1] H 12 1 0 2 H +=
c[1][0]
a[1][2] * l[2][0] H
3 0 1 1 H 13 1 0 3 H +=
c[1][0]
c[0][1] += a[0][1] * l[1][1] M a[1][3] * l[3][0]
4 1 0 0 H
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 M +=
c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 3 M +=
c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 2 H +=
c[0][1]
c[0][0] += a[0][1] * l[1][0] H a[0][2] * l[2][1] H
11 0 1 3 H +=
c[0][1]
2 0 1 0 H a[0][3] * l[3][1] M
c[0][1] += a[0][0] * l[0][1] H 12 1 0 2 H +=
c[1][0]
a[1][2] * l[2][0] H
3 0 1 1 H 13 1 0 3 H +=
c[1][0]
c[0][1] += a[0][1] * l[1][1] M a[1][3] * l[3][0] H
4 1 0 0 H 14 1 1 2 H +=
c[1][1]
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 M +=
c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 3 M +=
c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 2 H +=
c[0][1]
c[0][0] += a[0][1] * l[1][0] H a[0][2] * l[2][1] H
11 0 1 3 H +=
c[0][1]
2 0 1 0 H a[0][3] * l[3][1] M
c[0][1] += a[0][0] * l[0][1] H 12 1 0 2 H +=
c[1][0]
a[1][2] * l[2][0] H
3 0 1 1 H 13 1 0 3 H +=
c[1][0]
c[0][1] += a[0][1] * l[1][1] M a[1][3] * l[3][0] H
4 1 0 0 H 14 1 1 2 H +=
c[1][1]
c a l

= x

iter i j k a iter
l i j k a
operation M
operation M 8 0 0 2 M +=
c[0][0]
0 0 0 0 M a[0][2] * l[2][0] H
9 0 0 3 M +=
c[0][0]
c[0][0] += a[0][0] * l[0][0] H a[0][3] * l[3][0] H
1 0 0 1 M 10 0 1 2 H +=
c[0][1]
c[0][0] += a[0][1] * l[1][0] H What is
a[0][2] * l[2][1] the miss rate of a? H
11 0 1 3 H +=
c[0][1]
2 0 1 0 H a[0][3] * l[3][1] M
c[0][1] += a[0][0] * l[0][1] H 12 1 0 2 H +=
c[1][0]
a[1][2] * l[2][0] H
3 0 1 1 H 13 What 1is the miss 0rate of l? 3 H +=
c[1][0]
c[0][1] += a[0][1] * l[1][1] M a[1][3] * l[3][0] H
4 1 0 0 H 14 1 1 2 H +=
c[1][1]
Carnegie Mellon

Practice Problems
Class Question / Discussions
■ We’ll work through a series of questions
■ Write down your answer for each question
■ You can discuss with your classmates
What Type of Locality?
• The following function exhibits which type of
locality? Consider only array accesses.
void who(int *arr, int size) {
for (int i = 0; i < size-1; ++i)
arr[i] = arr[i+1];
}

A. Spatial
B. Temporal
C. Both A and B
D. Neither A nor B

64
What Type of Locality?
• The following function exhibits which type of
locality? Consider only array accesses.
void who(int *arr, int size) {
for (int i = 0; i < size-1; ++i)
arr[i] = arr[i+1];
}

A. Spatial
B. Temporal
C. Both A and B
D. Neither A nor B

65
What Type of Locality?
• The following function exhibits which type of
locality? Consider only array accesses.
void coo(int *arr, int size) {
for (int i = size-2; i >= 0; --i)
arr[i] = arr[i+1];
}

A. Spatial
B. Temporal
C. Both A and B
D. Neither A nor B

66
What Type of Locality?
• The following function exhibits which type of
locality? Consider only array accesses.
void coo(int *arr, int size) {
for (int i = size-2; i >= 0; --i)
arr[i] = arr[i+1];
}

A. Spatial
B. Temporal
C. Both A and B
D. Neither A nor B

67
Calculating Cache Parameters
• Given the following address partition, how many
int values will fit in a single data block?

# of int in block
18 10 4
Address:
31
bits bits bits
0
A. 0
B. 1
Tag Set Block
index offset C. 2
D. 4
E. Unknown: We
need more info
Calculating Cache Parameters
• Given the following address partition, how many
int values will fit in a single data block?

# of int in block
18 10 4
Address:
31
bits bits bits
0
A. 0
B. 1
Tag Set Block
index offset C. 2
D. 4
E. Unknown: We
need more info
Direct-Mapped Cache Example
• Assuming a 32-bit address (i.e. m=32), how many bits
are used for tag (t), set index (s), and block offset (b).
8 bytes
per data block

Set E = 1 lines per

Valid Tag Cache block
0: set
Set
1:
Valid Tag Cache block t s b
Set A. 1 2 3
Valid Tag Cache block
2:
Set
Valid Tag Cache block B. 27 2 3
3:
C. 25 4 3
t s b D. 1 4 8
bits bits bits
31 0 E. 20 4 8
Tag Set index Block
offset
Direct-Mapped Cache Example
• Assuming a 32-bit address (i.e. m=32), how many bits
are used for tag (t), set index (s), and block offset (b).
8 bytes
per data block

Set E = 1 lines per

Valid Tag Cache block
0: set
Set
1:
Valid Tag Cache block t s b
Set A. 1 2 3
Valid Tag Cache block
2:
Set
Valid Tag Cache block B. 27 2 3
3:
C. 25 4 3
t s b D. 1 4 8
bits bits bits
31 0 E. 20 4 8
Tag Set index Block
offset
Which Set Is it?
• Which set is the address 0xFA1C located in?
8 bytes
per data block

Set E = 1 lines per

Valid Tag Cache block
0: set
Set Cache block
1:
Valid Tag Set # for
Set
Valid Tag Cache block 0xFA1C
2:
Set
Valid Tag Cache block
A. 0
3:
B. 1
0xFA1C = 0b11111010000 11 100
27 2 3
C. 2

31
bits bits bits
0
D. 3
E. More than one
Tag Set index Block of the above
offset
Which Set Is it?
• Which set is the address 0xFA1C located in?
8 bytes
per data block

Set E = 1 lines per

Valid Tag Cache block
0: set
Set Cache block
1:
Valid Tag Set # for
Set
Valid Tag Cache block 0xFA1C
2:
Set
Valid Tag Cache block
A. 0
3:
B. 1
0xFA1C = 0b11111010000 11 100
27 2 3
C. 2

31
bits bits bits
0
D. 3
E. More than one
Tag Set index Block of the above
offset
Cache Block Range
• What range of addresses will be in the same block as
address 0xFA1C? 8 bytes
per data block

Set
Valid Tag Cache block
0:
Set
Addr. Range
Valid Tag Cache block
1:
Set A. 0xFA1C
Valid Tag Cache block
2:
Set
B. 0xFA1C –
Valid Tag Cache block 0xFA23
3:

0xFA1C = 0b11111010000 11 100 C. 0xFA1C –

0xFA1F
27 2 3
31
bits bits bits
0
D. 0xFA18 –
0xFA1F
Tag Set index Block E. It depends on
offset
Cache Block Range
• What range of addresses will be in the same block as
address 0xFA1C? 8 bytes
per data block

Set
Valid Tag Cache block
0:
Set
Addr. Range
Valid Tag Cache block
1:
Set A. 0xFA1C
Valid Tag Cache block
2:
Set
B. 0xFA1C –
Valid Tag Cache block 0xFA23
3:

0xFA1C = 0b11111010000 11 100 C. 0xFA1C –

0xFA1F
27 2 3
31
bits bits bits
0
D. 0xFA18 –
0xFA1F
Tag Set index Block E. It depends on
offset
Cache Misses
If N = 16, how many bytes does the loop access of a?

Accessed
int foo(int* a, int N)
Bytes
{
int i; A 4
int sum = 0;
for(i = 0; i < N; i++) B 16
{
C 64
sum += a[i];
} D 256
return sum;
}
Cache Misses
If N = 16, how many bytes does the loop access of a?

Accessed
int foo(int* a, int N)
Bytes
{
int i; A 4
int sum = 0;
for(i = 0; i < N; i++) B 16
{
C 64
sum += a[i];
} D 256
return sum;
}
Cache Misses
Consider a 32 KB cache in a 32 bit address space. The cache is 8-way associative
and has 64 bytes per block. A LRU (Least Recently Used) replacement policy is used.
What is the miss rate on ‘pass 1’?

void muchAccessSoCacheWow(int *bigArr){

// 48 KB array of ints
int length = (48*1024)/sizeof(int); Miss Rate
int access = 0; A 0%
// traverse array with stride 8 B 25 %
// pass 1
for(int i = 0; i < length; i+=8){
C 33 %
access = bigArr[i];
} D 50 %
// pass 2 E 66 %
for(int i = 0; i < length; i+=8){
access = bigArr[i];
}
}
Cache Misses
Consider a 32 KB cache in a 32 bit address space. The cache is 8-way associative
and has 64 bytes per block. A LRU (Least Recently Used) replacement policy is used.
What is the miss rate on ‘pass 1’?

void muchAccessSoCacheWow(int *bigArr){

}
}
Detailed explanation in Appendix!
Appendix: C Programming Style
• Properly document your code
• Function + File header comments, overall operation of large blocks, any tricky bits
• Write robust code – check error and failure conditions
• Write modular code
• Use interfaces for data structures, e.g. create/insert/remove/free functions for a
linked list
• No magic numbers – use #define or static const
• Formatting
• 80 characters per line (use Autolab’s highlight feature to double-check)
• Consistent braces and whitespace
• No memory or file descriptor leaks
Appendix: Git Usage
• Commit early and often!
• At minimum at every major milestone
• Commits don’t cost anything!

• Popular stylistic conventions

• Branches: short, descriptive names
• Commits: A single, logical change. Split large changes into multiple
commits.
• Messages:
• Summary: Descriptive, yet succinct
• Body: More detailed description on what you changed, why you
changed it, and what side effects it may have
Appendix: Parsing Input with fscanf
• fscanf(FILE *stream, const char *format, …)
• “scanf” but for files

• Arguments
1. A stream pointer, e.g. from fopen()
2. Format string for parsing, e.g “%c %d,%d”
3+. Pointers to variables for parsed data
• Can be pointers to stack variables

• Return Value
• Success: # of parsed vars
• Failure: EOF
• man fscanf
Appendix: fscanf() Example
FILE *pFile;
pFile = fopen(“trace.txt”, "r"); // Open file for reading

// TODO: Error check sys call

char access_type;
unsigned long address;
int size;

// Line format is " S 2f,1" or " L 7d0,3"

// - 1 character, 1 hex value, 1 decimal value
while (fscanf(pFile, " %c %lx, %d", &access_type, &address, &size) > 0)
{
// TODO: Do stuff
}

fclose(pFile); // Clean up Resources

Appendix: Discussion Questions
• What did the optimal transversal orders have in common?

• How does the pattern generalize to int[8][8] A and a

cache that holds 4 lines each of 4 int’s?
Appendix: Blocking Example
• We have a 2D array int[4][4] A;
• Cache is fully associative and can hold two lines
• Each line can hold two int values

Consider the following:

• What is the best miss rate for traversing A once?

• What order does of traversal did you use?

• What other traversal orders can achieve this miss rate?

Appendix: Cache Misses
If there is a 48KB cache with 8 bytes per block and 3 cache lines
per set, how many misses if foo is called twice? N still equals 16.
NOTE: This is a contrived example since the number of cache lines must be a power of 2.
However, it still demonstrates an important point.

Misses
int foo(int* a, int N)
{ A 0
int i;
int sum = 0; B 8
for(i = 0; i < N; i++)
{ C 12
sum += a[i]; D 14
}
return sum; E 16
}
Appendix: Cache Misses
If there is a 48KB cache with 8 bytes per block and 3 cache lines
per set, how many misses if foo is called twice? N still equals 16.
NOTE: This is a contrived example since the number of cache lines must be a power of 2.
However, it still demonstrates an important point.

Misses
int foo(int* a, int N)
{ A 0
int i;
int sum = 0; B 8
for(i = 0; i < N; i++)
{ C 12
sum += a[i]; D 14
}
return sum; E 16
}
Appendix: Very Hard Cache Problem
• We will use a direct-mapped cache with 2 sets, which each can
hold up to 4 int’s.
• How can we copy A into B, shifted over by 1 position?
• The most efficient way? (Use temp!)

A 0 1 2 3 4 5 6 7

B 0 1 2 3 4 5 6 7
temp 0 1 2 3 4 5 6 7

A 0 1 2 3 4 5 6 7

B 0 1 2 3 4 5 6 7

Number of misses:
temp 0 1 2 3 4 5 6 7

A 0 1 2 3 4 5 6 7

B 0 1 2 3 4 5 6 7

Number of misses: 🡨 Could’ve been 16 misses otherwise!

We would save even more if the block size
were larger, or if temp were already cached
Appendix: 48KB Cache Explained (1)
We access the int array in strides of 8 (note the comment and the i += 8). Each block is 64 bytes, which is enough to hold 16
ints, so in each block:

| 8 ints = 32B | 8 ints = 32B |

+---------------+---------------+
|m| | | | | | | |h| | | | | | | |
+---------------+---------------+
| 16 ints = 64B

The "m" denotes a miss, and the "h" denotes a hit. This pattern will repeat for the entirety of the array.

We can be sure that the second access is always a hit. This is because the first access will load the entire 64-byte block into
the cache (since the entire block is always loaded if any of its elements are accessed).

So, the big question is why the first access is always a miss. To answer this, we must understand many things about the cache.

First of all, we know that s, the number of set bits, is 6, which means there are 64 sets. Since each set maps to 64 bytes (as
there are b = 6 block bits), we know that every 64 * 64 bytes = 4 kilobytes we run out of sets:

64B 64B 64B 64B

+-------+-------+--...--+--------+-------+--...
| set 0 | set 1 | | set 63 | set 0 |
+-------+-------+--...--+--------+-------+--...
| 64 * 64B = 4KB |

Clearly, this pattern will repeat for the entirety of the array.
Appendix: 48KB Cache Explained (2)
However, note that we have E = 8 lines per set. That means that even though the next 4KB map to the same sets (0-63) as the first
4KB, they will just be put in another line in the cache, until we run out of lines (i.e., after we've gone through 8 * 4KB = 32KB
of memory). Splitting up the bigArr into 16KB chunks:

16KB 16KB 16KB

We see that section A will take up 16KB = 4 * 4KB; like we said, each of those 4KB chunks will take up 1 line each, so section A
uses 4 lines per set (and uses all 64 sets).

Similarly, section B also takes up 16KB = 4 * 4KB; again, each of those 4KB chunks will take up 1 line each, so section B also
uses 4 lines per set (and uses all 64 sets).

Note that as all of this data is being loaded in, our cache is still cold (does not contain any data from those sections), so the
previous assumption about the first of every other access missing (the "m" above) is still true.

After we read in sections A and B, the cache looks like:

line 0 1 2 3 4 5 6 7
+-------+-------+
0 | | |
1 | | |
s . . . .
e . . A . B .
t . . . .
62| | |
63| | |
+-------+-------+
Appendix: 48KB Cache Explained (3)
However, once we reach section C, we've run out of lines! So what do we have to do? We have to start evicting lines. And of
course, the least-recently used lines are the ones used to store the data from A (lines 0-3), since we just loaded in the stuff
from B. So, first of all, these evictions are causing misses on the first of every other read, so that "m" assumption is still
true. Second, after we read in the entirety of section C, the cache looks like:

line 0 1 2 3 4 5 6 7
+-------+-------+
0 | | |
1 | | |
s . . . .
e . . C . B .
t . . . .
62| | |
63| | |
+-------+-------+

Thus, we know now that the miss rate for the first pass is 50%.
Appendix: 48KB Cache Explained (4)
If we now consider the second pass, we're starting over at the beginning of bigArr (i.e., now we're reading section A). However,
there's a problem - section A isn't in the cache anymore! So we get a bunch of evictions (the "m" assumption is still true, of
course, since these evictions must also be misses). What are we evicting? The least-recently used lines, which are now lines 4-7
(holding data from B). Thus, the cache after reading section A looks like:

line 0 1 2 3 4 5 6 7
+-------+-------+
0 | | |
1 | | |
s . . . .
e . . C . A .
t . . . .
62| | |
63| | |
+-------+-------+

Then, we access B. But it isn't in the cache either! So we evict the least-recently-used lines (in this case, the lines that were
holding section C, 0-3) (the "m" assumption still holds); afterwards, the cache looks like:

line 0 1 2 3 4 5 6 7
+-------+-------+
0 | | |
1 | | |
s . . . .
e . . B . A .
t . . . .
62| | |
63| | |
+-------+-------+
Appendix: 48KB Cache Explained (5)
And finally, we access section C. But of course, its data isn't in the cache at all, so we again evict the least-recently used
lines (in this case, section A's lines, 4-7) (again, "m" assumption holds):

line 0 1 2 3 4 5 6 7
+-------+-------+
0 | | |
1 | | |
s . . . .
e . . B . C .
t . . . .
62| | |
63| | |
+-------+-------+

And so the miss rate is 50% for the second pass as well.

Thank you to Stan Zhang for coming up with such a detailed explanation!

Lecture 16
No ratings yet
Lecture 16
123 pages
Lecture 17
No ratings yet
Lecture 17
83 pages
Rec 07
No ratings yet
Rec 07
40 pages
CS33 S25 L14 OpenMP Intro Annotated
No ratings yet
CS33 S25 L14 OpenMP Intro Annotated
73 pages
10 Cache
No ratings yet
10 Cache
28 pages
4.2 CacheMemory
No ratings yet
4.2 CacheMemory
61 pages
For (I 0 I 8 I++) For (J 0 J 8000 J++) A (I) (J) B (I) (0) +A (J) (I)
100% (1)
For (I 0 I 8 I++) For (J 0 J 8000 J++) A (I) (J) B (I) (0) +A (J) (I)
4 pages
Chap 5 Memory System p1
No ratings yet
Chap 5 Memory System p1
30 pages
Lecture 8
No ratings yet
Lecture 8
22 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
Chapter 5
No ratings yet
Chapter 5
131 pages
Cache Performance
No ratings yet
Cache Performance
44 pages
2 Cache Complexity
No ratings yet
2 Cache Complexity
100 pages
Lecture11 Cda3101
No ratings yet
Lecture11 Cda3101
73 pages
Disc09 Sols
No ratings yet
Disc09 Sols
7 pages
Lecture 03
No ratings yet
Lecture 03
37 pages
Parallel & Distributed Computing
No ratings yet
Parallel & Distributed Computing
58 pages
Module 5
No ratings yet
Module 5
17 pages
DigitalLogic ComputerOrganization L20 CachesP1 Handout
No ratings yet
DigitalLogic ComputerOrganization L20 CachesP1 Handout
43 pages
06 - Memory System - I
No ratings yet
06 - Memory System - I
63 pages
HW 4
No ratings yet
HW 4
4 pages
Explana'on of Exercise 5: Cache
No ratings yet
Explana'on of Exercise 5: Cache
22 pages
Replacement Policy
No ratings yet
Replacement Policy
25 pages
4.2 Cachememory
No ratings yet
4.2 Cachememory
12 pages
Class11 Cache
No ratings yet
Class11 Cache
41 pages
Memory Cache (Finley 2000)
No ratings yet
Memory Cache (Finley 2000)
15 pages
Project Report On Verilog Implementation of Different Types of Caches
No ratings yet
Project Report On Verilog Implementation of Different Types of Caches
12 pages
Confusion
No ratings yet
Confusion
4 pages
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
No ratings yet
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
25 pages
hw2 Solns
No ratings yet
hw2 Solns
15 pages
Comp Arch Lect5
No ratings yet
Comp Arch Lect5
26 pages
CPSC 312 Cache Memories: Topics
No ratings yet
CPSC 312 Cache Memories: Topics
39 pages
Lec17 Cache 3
No ratings yet
Lec17 Cache 3
33 pages
COAL Assignment 2
No ratings yet
COAL Assignment 2
10 pages
ES Test Assignment Sol.
No ratings yet
ES Test Assignment Sol.
6 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Csapp Lab3
No ratings yet
Csapp Lab3
7 pages
EE6304 Lecture9 Mem Caches
No ratings yet
EE6304 Lecture9 Mem Caches
61 pages
Lec8 - Caches
No ratings yet
Lec8 - Caches
55 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
No ratings yet
פרק ט - גדול ומהיר - ניצול היררכיות זיכרון
77 pages
Cache Mapping
No ratings yet
Cache Mapping
11 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
CS2115 Chapter-6
No ratings yet
CS2115 Chapter-6
45 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
Lab 8
No ratings yet
Lab 8
10 pages
Fundamentals of Computer Systems: Caches
No ratings yet
Fundamentals of Computer Systems: Caches
28 pages
Midterm2 s2012 Sol
No ratings yet
Midterm2 s2012 Sol
5 pages
Computer Architecture and Organization: Lecture13: Cache Memory Organization
No ratings yet
Computer Architecture and Organization: Lecture13: Cache Memory Organization
8 pages
Cache: Contents and Introduction
No ratings yet
Cache: Contents and Introduction
13 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Chap 6
No ratings yet
Chap 6
48 pages
Assosiative Mapping - Cache Memory
No ratings yet
Assosiative Mapping - Cache Memory
2 pages
DelcoRemy DiagnosticManual Updated Digital
No ratings yet
DelcoRemy DiagnosticManual Updated Digital
32 pages
Communication Aids and Strategies Using Tools of Technology
No ratings yet
Communication Aids and Strategies Using Tools of Technology
32 pages
Strategic Environmental Assessment Framework
No ratings yet
Strategic Environmental Assessment Framework
30 pages
Finalworm 160204043543
No ratings yet
Finalworm 160204043543
20 pages
BUF16821 DC-DC Ic
100% (1)
BUF16821 DC-DC Ic
31 pages
Syllabus Math 7
No ratings yet
Syllabus Math 7
12 pages
PHD Thesis GauthamRam Cover Final
No ratings yet
PHD Thesis GauthamRam Cover Final
251 pages
1 - Tuberia 4'' SCH40 222956 Tpco
No ratings yet
1 - Tuberia 4'' SCH40 222956 Tpco
2 pages
2025 - Fairview Bio Pi Mock F4
No ratings yet
2025 - Fairview Bio Pi Mock F4
13 pages
Mil H 6875H
No ratings yet
Mil H 6875H
29 pages
Plucker and Callahan 2014
No ratings yet
Plucker and Callahan 2014
17 pages
Unit 2 Principles of Assessm Ent in Instructional Decision
No ratings yet
Unit 2 Principles of Assessm Ent in Instructional Decision
11 pages
CPP Ignou
No ratings yet
CPP Ignou
187 pages
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
No ratings yet
Mahieddine Darbal Middle School School Year2017/2018 Level MS3
3 pages
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
No ratings yet
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
33 pages
Fuzzy Logic To Controlled Signal System
No ratings yet
Fuzzy Logic To Controlled Signal System
10 pages
An Exhaust Emissions Based Air-Fuel Ratio Calculation
No ratings yet
An Exhaust Emissions Based Air-Fuel Ratio Calculation
8 pages
Current Affairs - Compendium - DMS - IIT - Delhi
No ratings yet
Current Affairs - Compendium - DMS - IIT - Delhi
28 pages
Lecture 1
No ratings yet
Lecture 1
20 pages
Keyboard Layout Selection Procedure
No ratings yet
Keyboard Layout Selection Procedure
8 pages
DxDiag Requisitos
No ratings yet
DxDiag Requisitos
30 pages
Habib Rehman Presentation
No ratings yet
Habib Rehman Presentation
8 pages
SOCI1003 Assignment Cover Sheet
No ratings yet
SOCI1003 Assignment Cover Sheet
7 pages
Ed Ruscha's One Way Street
No ratings yet
Ed Ruscha's One Way Street
16 pages
Role of Values in Human Life
No ratings yet
Role of Values in Human Life
12 pages
A Business Research On SPACEX
No ratings yet
A Business Research On SPACEX
5 pages
Writing Letter of Apllication and Resume
No ratings yet
Writing Letter of Apllication and Resume
10 pages
MAQ TNC AC Test
No ratings yet
MAQ TNC AC Test
1 page
Reflective Essay
No ratings yet
Reflective Essay
4 pages
Rubric 4
No ratings yet
Rubric 4
5 pages
An Introduction To Digital Design
From Everand
An Introduction To Digital Design
Jason King
2/5 (1)

Recitation05 Cachelab

Uploaded by

Recitation05 Cachelab

Uploaded by

15-213 Recitation

Caches and Blocking

Nishi, Sabrina, Pallavi, Abi

■ Cache Lab is due Tuesday, Oct. 20th at 11:59pm

V D Tag 0 1 2 3 .. B-1 B = 2b bytes per block

■ Address of word: | t bits | s bits | b bits |

C = 32768 (32 KiB)

How did we get S?

● 64 bit address space: m = 64

● tag bits: 0000000011000000100

b=00 b=01 b=10 b=11 b=00 b=01 b=10 b=11

S=1 b=00 b=01 b=10 b=11 b=00 b=01 b=10 b=11

L 6,1 0x00 0x01 0x02 0x03

L 6,1 M 0x00 0x01 0x02 0x03

L 6,1 M 0x00 0x01 0x02 0x03

L 6,1 M 0x00 0x01 0x02 0x03

0x08 0x09 0x0a 0x0b

0x08 0x09 0x0a 0x0b

Let’s step through this to see what’s actually happening

■ Assume a tiny cache with 4 lines of 8 bytes (2 ints)

Let’s step through this to see what’s actually happening

■ Assume a tiny cache with 4 lines of 8 bytes (2 ints)

Set E = 1 lines per

Set E = 1 lines per

Set E = 1 lines per

Set E = 1 lines per

0xFA1C = 0b11111010000 11 100 C. 0xFA1C –

0xFA1C = 0b11111010000 11 100 C. 0xFA1C –

void muchAccessSoCacheWow(int *bigArr){

void muchAccessSoCacheWow(int *bigArr){

void muchAccessSoCacheWow(int *bigArr){

void muchAccessSoCacheWow(int *bigArr){

• Popular stylistic conventions

// TODO: Error check sys call

// Line format is " S 2f,1" or " L 7d0,3"

fclose(pFile); // Clean up Resources

• How does the pattern generalize to int[8][8] A and a

Consider the following:

• What is the best miss rate for traversing A once?

• What other traversal orders can achieve this miss rate?

Number of misses: 🡨 Could’ve been 16 misses otherwise!

| 8 ints = 32B | 8 ints = 32B |

64B 64B 64B 64B

16KB 16KB 16KB

After we read in sections A and B, the cache looks like:

You might also like