0% found this document useful (0 votes)
15 views

Lecture16

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture16

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

CSC 252/452: Computer Organization

Fall 2024: Lecture 16

Instructor: Yanan Guo

Department of Computer Science


University of Rochester
Carnegie Mellon

Announcements
• Mid-term grades released. Solution on the website.
• Talk to a TA if you have doubts. Make an appointment if you
can’t make any TA office hours.
• Come to my office hour if TAs cannot solve your problems.

2
Carnegie Mellon

Cache Illustrations
CPU

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

3
Carnegie Mellon

Cache Illustrations
CPU

Cache
8 9 14 3
(small but fast)

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

4
Carnegie Mellon

Cache Illustrations
CPU
Request Data Data in address b is needed
at Address 14

Cache
8 9 14 3
(small but fast)

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

4
Carnegie Mellon

Cache Illustrations
CPU
Request Data Data in address b is needed
at Address 14

Cache
8 9 14 3 Address b is in cache: Hit!
(small but fast)

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

4
Carnegie Mellon

Cache Illustrations
CPU

Cache
8 9 14 3
(small but fast)

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data
at Address 12

Cache
8 9 14 3
(small but fast)

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data Data in address b is needed
at Address 12

Cache Address b is not in cache:


8 9 14 3
(small but fast) Miss!

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data Data in address b is needed
at Address 12

Cache Address b is not in cache:


8 9 14 3
(small but fast) Miss!

Address b is fetched from


Request: 12
memory

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data Data in address b is needed
at Address 12

Cache Address b is not in cache:


8 9 14 3
(small but fast) Miss!

Address b is fetched from


Request: 12
memory

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data Data in address b is needed
at Address 12

Cache Address b is not in cache:


8 9 14 3
(small but fast) Miss!

Address b is fetched from


12 Request: 12
memory

0 1 2 3
Memory
4 5 6 7
(big but slow)
8 9 10 11
12 13 14 15

5
Carnegie Mellon

Cache Illustrations
CPU
Request data Data in address b is needed
at Address 12

Cache Address b is not in cache:


8 9
12 14 3
(small but fast) Miss!

Address b is fetched from


Request: 12
memory

0 1 2 3
Memory
4 5 6 7
(big but slow) Address b is stored in cache
8 9 10 11
12 13 14 15

5
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
Content Valid? 0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111

6
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111

6
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111

6
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
1010
1011
1100
1101
1110
1111

6
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
• Cache is smaller than memory
1010
(obviously)
1011
1100
1101
1110
1111

6
Carnegie Mellon

A Simple Cache
Cache Memory • 16 memory locations
0000 • 4 cache locations
0001
• Also called cache-line
Content Valid? 0010
0011 a • Every location has a valid bit, indicating
0100 whether that location contains valid data;
00 0101
0 initially.
01 0110 • For now, assume cache location size
10 0111 == memory location size == 1 B
11 1000
1001
• Cache is smaller than memory
1010
(obviously)
1011 • Thus, not all memory locations can
1100 be cached at the same time
1101
1110
1111

6
Carnegie Mellon

Cache Placement
Cache Memory • Given a memory addr, say 0x0001, we
want to put the data there into the
0000
cache; where does the data go?
0001
0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
1110
1111

7
Carnegie Mellon

Fully-Associative Cache
Cache Memory • Every memory location can be mapped
to any cache line in the cache.
0000
0001 • Given a request to address A from
Content Valid? 0010 CPU, detecting cache hit/miss
0011 a requires:
00 0xAA
0100
• Comparing address A with all four
0101
01 0XBB tags in the cache (a.k.a.,
0110 0xAA
10 0xCC 0111
associative search)
11 0xDD 1000 0xBB • Can we reduce the overhead:
1001
• of storing tags
1010
1011 0xCC • of comparison
1100 0xDD
1101
1110
1111

8
Carnegie Mellon

Fully-Associative Cache
Cache Memory • Every memory location can be mapped
to any cache line in the cache.
0000
0001 • Given a request to address A from
Content Valid? 0010 CPU, detecting cache hit/miss
0011 a requires:
00 0xBB
0100
• Comparing address A with all four
0101
01 0XAA tags in the cache (a.k.a.,
0110 0xAA
10 0xCC 0111
associative search)
11 0xDD 1000 0xBB • Can we reduce the overhead:
1001
• of storing tags
1010
1011 0xCC • of comparison
1100 0xDD
1101
1110
1111

9
Carnegie Mellon

Fully-Associative Cache
Cache Memory • Every memory location can be mapped
to any cache line in the cache.
0000
0001 • Given a request to address A from
Content Valid? Tag 0010 CPU, detecting cache hit/miss
0011 a requires:
00 0xBB
0100
• Comparing address A with all four
0101
01 0XAA tags in the cache (a.k.a.,
0110 0xAA
10 0xCC 0111
associative search)
11 0xDD 1000 0xBB • Can we reduce the overhead:
1001
• of storing tags
1010
1011 0xCC • of comparison
1100 0xDD
1101
1110
1111

10
Carnegie Mellon

Fully-Associative Cache
Cache Memory • Every memory location can be mapped
to any cache line in the cache.
0000
0001 • Given a request to address A from
Content Valid? Tag 0010 CPU, detecting cache hit/miss
0011 a requires:
00 0xBB addr [3:0]
0100
• Comparing address A with all four
0101
01 0XAA addr [3:0] tags in the cache (a.k.a.,
0110 0xAA
10 0xCC addr [3:0]
0111
associative search)
11 0xDD addr [3:0]
1000 0xBB • Can we reduce the overhead:
1001
• of storing tags
1010
1011 0xCC • of comparison
1100 0xDD
1101
1110
1111

10
Carnegie Mellon

A Few Terminologies

• A cache line: content + valid bit + tag bits 0000

• Valid bit + tag bits are “overhead” 0001


0010
• Content is what you really want to store 0011 a
• But we need valid and tag bits to correctly 0100

access the cache 0101


0110
0111
1000 0xEF
1001 0xAC
1010 0x06
1011
1100
1101 0x70
1110
1111

11
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011
0100
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00 0101
01 0110
10 0111
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000
1001
1010
1011
1100
1101
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000 • E.g., 0010 and 1010
1001
1010
1011
1100
1101
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
• Add a tag field for that purpose
1111

Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
• Add a tag field for that purpose
1111
• What should the tag field be?
Mem addr

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00
• Multiple addresses can be
0101
01 0110
10 0111 mapped to the same location
11 1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
• Add a tag field for that purpose
1111
• What should the tag field be?
Mem addr • ADDR[3] and ADDR[2] in this particular
example

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00 addr [3:2]

• Multiple addresses can be


0101
01 addr [3:2]
0110
10 addr [3:2]
0111 mapped to the same location
11 addr [3:2]
1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
• Add a tag field for that purpose
1111
• What should the tag field be?
Mem addr • ADDR[3] and ADDR[2] in this particular
example

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
Tag 0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00 addr [3:2]

• Multiple addresses can be


0101
01 addr [3:2]
0110
10 addr [3:2]
0111 mapped to the same location
11 addr [3:2]
1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] 1110
• Add a tag field for that purpose
1111
• What should the tag field be?
Mem addr • ADDR[3] and ADDR[2] in this particular
example

12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Direct-Mapped Cache
0000 • One address can only be
Tag 0001 mapped to one cache line
0010 • CA = ADDR[1],ADDR[0]
0011 • Always use the lower order address
0100 bits
00 addr [3:2]

• Multiple addresses can be


0101
01 addr [3:2]
0110
10 addr [3:2]
0111 mapped to the same location
11 addr [3:2]
1000 • E.g., 0010 and 1010

• How do we differentiate between


1001
1010
1011 different memory locations that
1100 are mapped to the same cache
1101 location?
addr[1:0] = Hit? 1110
• Add a tag field for that purpose
1111
• What should the tag field be?
Mem addr addr[3:2] • ADDR[3] and ADDR[2] in this particular
example
CPU
12
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Limitation: each memory
location can be mapped to only
0000 one cache location.
0001
Tag
0010
0011
0100
00 addr [3:2]
0101
01 addr [3:2]
0110
10 addr [3:2]
0111
11 addr [3:2]
1000
1001
1010
1011
1100
1101
addr[1:0] = Hit? 1110
1111

Mem addr addr[3:2]

CPU
13
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Limitation: each memory
location can be mapped to only
0000 one cache location.
0001
Tag
0010
0011
• This leads to a lot of conflicts.
0100
00 addr [3:2]
0101
01 addr [3:2]
0110
10 addr [3:2]
0111
11 addr [3:2]
1000
1001
1010
1011
1100
1101
addr[1:0] = Hit? 1110
1111

Mem addr addr[3:2]

CPU
13
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Limitation: each memory
location can be mapped to only
0000 one cache location.
0001
Tag
0010
0011
• This leads to a lot of conflicts.
00 addr [3:2]
0100 • How do we improve this?
0101
01 addr [3:2]
0110
10 addr [3:2]
0111
11 addr [3:2]
1000
1001
1010
1011
1100
1101
addr[1:0] = Hit? 1110
1111

Mem addr addr[3:2]

CPU
13
Carnegie Mellon

Direct-Mapped Cache
Cache Memory • Limitation: each memory
location can be mapped to only
0000 one cache location.
0001
Tag
0010
0011
• This leads to a lot of conflicts.
00 addr [3:2]
0100 • How do we improve this?
01 addr [3:2]
0101
0110 • Can each memory location have
10 addr [3:2]
0111
the flexibility to be mapped to
11 addr [3:2]
1000 different cache locations?
1001
1010
1011
1100
1101
addr[1:0] = Hit? 1110
1111

Mem addr addr[3:2]

CPU
13
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
00 0011 a
01 0100
10 0101
11 0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

14
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000
1001
1010
1011
1100
1101
1110
1111

15
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000
2-Way Set Associative Cache 1001

• 4 cache lines are organized into two sets; each 1010


1011
set has 2 cache lines (i.e., 2 ways) 1100
• Lowest bit is used for cache index 1101
• Even address go to first set and odd 1110

addresses go to the second set 1111

• Each address can be mapped to either cache


line in the same set
• Tag now stores the higher 3 bits instead of
the entire address
15
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000

• Given a request to address, say 1011, from the 1001


1010
CPU, detecting cache hit/miss requires: 1011
1100
1101
1110
1111

16
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000

• Given a request to address, say 1011, from the 1001


1010
CPU, detecting cache hit/miss requires: 1011
• Using the LSB to index into the cache and find 1100
the corresponding set, in this case set 1 1101
1110
1111

16
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000

• Given a request to address, say 1011, from the 1001


1010
CPU, detecting cache hit/miss requires: 1011
• Using the LSB to index into the cache and find 1100
the corresponding set, in this case set 1 1101

• Then do an associative search in that set, i.e., 1110


1111
compare the highest 3 bits 101 with both tags
in set 1

16
Carnegie Mellon

Set-Associative Cache
Cache Memory
Content Valid? Tag 0000
0001
0010
Set 0 0011 a
0100
Set 1 0101
0110
Way 0 Way 1 0111
1000

• Given a request to address, say 1011, from the 1001


1010
CPU, detecting cache hit/miss requires: 1011
• Using the LSB to index into the cache and find 1100
the corresponding set, in this case set 1 1101

• Then do an associative search in that set, i.e., 1110


1111
compare the highest 3 bits 101 with both tags
in set 1
• Only two comparisons required

16
Carnegie Mellon

Direct-Mapped (1-way Associative) Cache


Cache Memory
Content Valid? Tag 0000
0001
00 0xEF 10 0010
01 0xAC 10 0011 a
10 0x06 10 0100
11 0101
0110

• 4 cache lines are organized into four sets 0111

• Each memory localization can only be


1000
1001
0xEF
0xAC
mapped to one set 1010 0x06
• Using the 2 LSBs to find the set 1011

• Tag now stores the higher 2 bits 1100


1101 0x70
1110
1111

17
Carnegie Mellon

Associative verses Direct Mapped Trade-offs

18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster

18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster

00 a 1 10
01 b 1 10
10 c 1 10
11 d 1 10

addr[1:0] = Hit?

addr addr[3:2]
18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster
• Set Associative cache
• Generally higher hit rate. Better utilization of cache resources
• Slower and higher power consumption. Why?
00 a 1 10 0 a 1 101 c 1 100
01 b 1 10 1 b 1 101 d 1 100
10 c 1 10
11 d 1 10

addr[1:0] = Hit?

addr addr[3:2]
18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster
• Set Associative cache
• Generally higher hit rate. Better utilization of cache resources
• Slower and higher power consumption. Why?
00 a 1 10 0 a 1 101 c 1 100
01 b 1 10 1 b 1 101 d 1 100
10 c 1 10
11 d 1 10

addr[1:0] = Hit? addr[0]

addr addr[3:2] addr


18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster
• Set Associative cache
• Generally higher hit rate. Better utilization of cache resources
• Slower and higher power consumption. Why?
00 a 1 10 0 a 1 101 c 1 100
01 b 1 10 1 b 1 101 d 1 100
10 c 1 10
11 d 1 10

= =

addr[1:0] = Hit? addr[0]

addr addr[3:2] addr addr[3:1]


18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs


• Direct-Mapped cache
• Generally lower hit rate
• Simpler, Faster
• Set Associative cache
• Generally higher hit rate. Better utilization of cache resources
• Slower and higher power consumption. Why?
00 a 1 10 0 a 1 101 c 1 100
01 b 1 10 1 b 1 101 d 1 100
10 c 1 10
11 d 1 10

= =

addr[1:0] = Hit? addr[0] Or

addr addr[3:2] addr addr[3:1] Hit?


18
Carnegie Mellon

Associative verses Direct Mapped Trade-offs

Miss rate versus cache size on the Integer portion of SPEC CPU2000
19
Carnegie Mellon

Cache Organization
• Finding a name in a roster
• If the roster is completely unorganized
• Need to compare the name with all the names in the roster
• Same as a fully-associative cache
• If the roster is ordered by last name, and within the same last
name different first names are unordered
• First find the last name group
• Then compare the first name with all the first names in the
same group
• Same as a set-associative cache

20
Carnegie Mellon

Cache Access Summary (So far…)


• Assuming b bits in a memory address
• The b bits are split into two halves:
• Lower s bits used as index to find a set. Total sets S = 2s
• The higher (b - s) bits are used for the tag
• Associativity n (i.e., the number of ways in a cache set) is
independent of the the split between index and tag

b s 0
Memory
tag index
Address

21
Carnegie Mellon

Locality again
• So far: temporal locality
• What about spatial?
• Idea: Each cache location (cache line) store multiple bytes

22
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
0000
0001
0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
0000
0001
0010
0011 a
0100
00 0101
01 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001
0010
0011 a
0100
00 a b 0101
01 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a
0100
00 a b 0101
01 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
0100
00 a b 0101
01 c d 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
00 a b
0100
0101
• Read 1011 (Hit!)
01 c d 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
= Hit? 1110
1111

addr

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
00 a b
0100
0101
• Read 1011 (Hit!)
01
10
c d 0110
0111
• How to access
11 1000 a the cache now?
1001 b
1010 c
1011 d
1100
1101
??? = Hit? 1110
1111

addr ???

23
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
00 a b
0100
0101
• Read 1011 (Hit!)
01 c d 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
1100
1101
addr[2:1] = Hit? 1110
1111

addr addr[3]

24
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
00 a b
0100
0101
• Read 1011 (Hit!)
01 c d 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
MUX 1100
1101
addr[2:1] To = Hit? 1110
CPU 1111

addr addr[3]

24
Carnegie Mellon

Cache-Line Size of 2
Cache Memory
• Read 1000
0000
0001 • Read 1001 (Hit!)
0010
0011 a • Read 1010
00 a b
0100
0101
• Read 1011 (Hit!)
01 c d 0110
10 0111
11 1000 a
1001 b
1010 c
1011 d
MUX 1100
1101
addr[2:1] To = Hit? 1110
CPU 1111

addr addr[3]

addr[0]
24
Carnegie Mellon

Cache Access Summary


• Assuming b bits in a memory address
• The b bits are split into three fields:
• Lower l bits are used for byte offset within a cache line. Cache line
size L = 2l
• Next s bits used as index to find a set. Total sets S = 2s
• The higher (b - l - s) bits are used for the tag
• Associativity n is independent of the the split between index and tag

b l+s l 0
Memory
tag index offset
Address

25
Carnegie Mellon

Handling Reads

2633
Carnegie Mellon

Handling Reads
• Read miss: Put into cache

2633
Carnegie Mellon

Handling Reads
• Read miss: Put into cache
• Any reason not to put into cache?

2633
Carnegie Mellon

Handling Reads
• Read miss: Put into cache
• Any reason not to put into cache?
• What to replace? Depends on the replacement policy. More on
this later.

2633
Carnegie Mellon

Handling Reads
• Read miss: Put into cache
• Any reason not to put into cache?
• What to replace? Depends on the replacement policy. More on
this later.
• Read hit: Nothing special. Enjoy the hit!

2633
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

• Write-through

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

• Write-through
• + Simpler

27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

• Write-through
• + Simpler
• + Memory is up to date
27
Carnegie Mellon

Handling Writes (Hit)


• Intricacy: data value is modified!
• Implication: value in cache will be different from that in memory!

• When do we write the modified data in a cache to the next level?


• Write through: At the time the write happens
• Write back: When the cache line is evicted

• Write-back
• + Can consolidate multiple writes to the same block before eviction.
Potentially saves bandwidth between cache and memory + saves
energy
• - Need a bit in the tag store indicating the block is “dirty/modified”

• Write-through
• + Simpler
• + Memory is up to date
• - More bandwidth intensive; no coalescing of writes
27
Carnegie Mellon

Handling Writes (Miss)


• Do we allocate a cache line on a write miss?
• Write-allocate: Allocate on write miss
• Non-Write-Allocate: No-allocate on write miss

• Allocate on write miss

28
Carnegie Mellon

Handling Writes (Miss)


• Do we allocate a cache line on a write miss?
• Write-allocate: Allocate on write miss
• Non-Write-Allocate: No-allocate on write miss

• Allocate on write miss


• + Can consolidate writes instead of writing each of them
individually to memory

28
Carnegie Mellon

Handling Writes (Miss)


• Do we allocate a cache line on a write miss?
• Write-allocate: Allocate on write miss
• Non-Write-Allocate: No-allocate on write miss

• Allocate on write miss


• + Can consolidate writes instead of writing each of them
individually to memory
• + Simpler because write misses can be treated the same way
as read misses

28
Carnegie Mellon

Handling Writes (Miss)


• Do we allocate a cache line on a write miss?
• Write-allocate: Allocate on write miss
• Non-Write-Allocate: No-allocate on write miss

• Allocate on write miss


• + Can consolidate writes instead of writing each of them
individually to memory
• + Simpler because write misses can be treated the same way
as read misses

• Non-allocate
• + Conserves cache space if locality of writes is low (potentially
better cache hit rate)

28
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?

29
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?
• Unified:

29
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?
• Unified:
• + Dynamic sharing of cache space: no overprovisioning that might
happen with static partitioning (i.e., split Inst and Data caches)

29
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?
• Unified:
• + Dynamic sharing of cache space: no overprovisioning that might
happen with static partitioning (i.e., split Inst and Data caches)
• - Instructions and data can thrash each other (i.e., no guaranteed
space for either)

29
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?
• Unified:
• + Dynamic sharing of cache space: no overprovisioning that might
happen with static partitioning (i.e., split Inst and Data caches)
• - Instructions and data can thrash each other (i.e., no guaranteed
space for either)
• - Inst and Data are accessed in different places in the pipeline.
Where do we place the unified cache for fast access?

29
Carnegie Mellon

Instruction vs. Data Caches


• Separate or Unified?
• Unified:
• + Dynamic sharing of cache space: no overprovisioning that might
happen with static partitioning (i.e., split Inst and Data caches)
• - Instructions and data can thrash each other (i.e., no guaranteed
space for either)
• - Inst and Data are accessed in different places in the pipeline.
Where do we place the unified cache for fast access?

• First level caches are almost always split


• Mainly for the last reason above

• Second and higher levels are almost always unified


29
Carnegie Mellon

General Rule: Bigger == Slower

Cache
CPU Memory
$

• How big should the cache be?


• Too small and too much memory traffic
• Too large and cache slows down execution (high latency)

3043
Carnegie Mellon

General Rule: Bigger == Slower

Cache
CPU Memory
$

• How big should the cache be?


• Too small and too much memory traffic
• Too large and cache slows down execution (high latency)

• Make multiple levels of cache


• Small L1 backed up by larger L2
• Today’s processors typically have 3 cache levels

3043
Carnegie Mellon

A Real Intel Processor

31
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:
• Any invalid cache line first

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:
• Any invalid cache line first
• If all are valid, consult the replacement policy

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:
• Any invalid cache line first
• If all are valid, consult the replacement policy
• Randomly pick one???

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:
• Any invalid cache line first
• If all are valid, consult the replacement policy
• Randomly pick one???
• Ideally: Replace the cache line that’s least likely going to be
used again

32
Carnegie Mellon

Eviction/Replacement Policy

• Which cache line should be replaced?


• Direct mapped? Only one place!
• Associative caches? Multiple places!
• For associative cache:
• Any invalid cache line first
• If all are valid, consult the replacement policy
• Randomly pick one???
• Ideally: Replace the cache line that’s least likely going to be
used again
• Approximation: Least recently used (LRU)

32
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Cache Lines 0 1

LRU index (1-bit)

33
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Address stream:
• Hit on 0
Cache Lines 0 1
• Hit on 1
• Miss, evict 0
LRU index (1-bit)

33
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Address stream:
• Hit on 0
Cache Lines 0 1
• Hit on 1
• Miss, evict 0
LRU index (1-bit) 1

33
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Address stream:
• Hit on 0
Cache Lines 0 1
• Hit on 1
• Miss, evict 0
LRU index (1-bit) 0
1

33
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Address stream:
• Hit on 0
Cache Lines 0 1
• Hit on 1
• Miss, evict 0
LRU index (1-bit) 0
1

33
Carnegie Mellon

Implementing LRU
• Idea: Evict the least recently accessed block
• Challenge: Need to keep track of access ordering of blocks
• Question: 2-way set associative cache:
• What do you need to implement LRU perfectly? One bit?

Address stream:
• Hit on 0
Cache Lines 0 1
• Hit on 1
• Miss, evict 0
LRU index (1-bit) 0
1

33
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
• Miss, evict 1

34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
How to update the • Miss, evict 1
LRU index now???
34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?
• Essentially have to track the ordering of all cache lines

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
How to update the • Miss, evict 1
LRU index now???
34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?
• Essentially have to track the ordering of all cache lines
• What are the hardware structures needed?

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
How to update the • Miss, evict 1
LRU index now???
34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?
• Essentially have to track the ordering of all cache lines
• What are the hardware structures needed?
• In reality, true LRU is never implemented. Too complex.

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
How to update the • Miss, evict 1
LRU index now???
34
Carnegie Mellon

Implementing LRU
• Question: 4-way set associative cache:
• What do you need to implement LRU perfectly? Will the same
mechanism work?
• Essentially have to track the ordering of all cache lines
• What are the hardware structures needed?
• In reality, true LRU is never implemented. Too complex.
• “Pseudo-LRU” is usually used in real processors.

Address stream:
Cache Lines 0 1 2 3 • Hit on 0
• Hit on 2
LRU index (2 bits) 1
• Hit on 3
How to update the • Miss, evict 1
LRU index now???
34

You might also like