0% found this document useful (0 votes)
116 views21 pages

CMSC 611: Advanced Computer Architecture

This document discusses computer memory hierarchies and cache memory. It describes why caches are needed to bridge the growing gap between processor and memory speeds. Various cache design issues are covered, such as block identification, placement, and replacement policies, as well as the basics of direct-mapped and set-associative caches. The goal is to maximize memory performance with the least cost through an effective cache design.

Uploaded by

manish0009
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views21 pages

CMSC 611: Advanced Computer Architecture

This document discusses computer memory hierarchies and cache memory. It describes why caches are needed to bridge the growing gap between processor and memory speeds. Various cache design issues are covered, such as block identification, placement, and replacement policies, as well as the basics of direct-mapped and set-associative caches. The goal is to maximize memory performance with the least cost through an effective cache design.

Uploaded by

manish0009
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CMSC 611: Advanced

Computer Architecture

Cache
Introduction
• Why do designers need to know about Memory technology?
– Processor performance is usually limited by memory bandwidth
– As IC densities increase, lots of memory will fit on chip
• What are the different types of memory?
• How to maximize memory performance with least cost?

Computer
Processor Memory Devices

Control Input

Datapath Output
Processor-Memory
Performance µProc
1000 CPU 60%/yr.
CPU-DRAM Gap“Moore’s Law” (2X/1.5yr)
Performance

100 Processor-Memory
Performance Gap:
(grows 50% / year)
10
DRAM
DRAM 9%/yr.
1 (2X/10 yrs)
1989
1980
1981
1982
1983
1984
1985
1986
1987
1988

1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Time
Problem: Memory can be a bottleneck for processor performance
Solution: Rely on memory hierarchy of faster memory to bridge the gap
Memory Hierarchy
• Temporal Locality (Locality in Time):
fi Keep most recently accessed data items closer to the processor

• Spatial Locality (Locality in Space):


fi Move blocks consists of contiguous words to the faster levels
Processor

Control
Secondary
Storage
Second Main (Disk)
On-Chip
Registers

Level Memory
Cache

Datapath Cache (DRAM)


(SRAM)

Compiler
Hardware
Speed: Fastest Operating Slowest
Size: Smallest System
Biggest
Cost: Highest Lowest
Memory Hierarchy
Terminology
• Hit: data appears in some block in the faster level (example: Block X)
– Hit Rate: the fraction of memory access found in the faster level
– Hit Time: Time to access the faster level which consists of
• Memory access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the slower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level + Time to
deliver the block the processor
• Hit Time << Miss Penalty

Slower Level
To Processor Faster Level Memory
Memory
Block X
From Processor
Block Y

Slide: Dave Patterson


Memory Hierarchy Design
Issues
• Block identification
– How is a block found if it is in the upper (faster) level?
• Tag/Block
• Block placement
– Where can a block be placed in the upper (faster) level?
• Fully Associative, Set Associative, Direct Mapped
• Block replacement
– Which block should be replaced on a miss?
• Random, LRU
• Write strategy
– What happens on a write?
• Write Back or Write Through (with Write Buffer)

Slide: Dave Patterson


The Basics of Cache
• Cache: level of hierarchy closest to processor
• Caches first appeared in research machines in early 1960s
• Virtually every general-purpose computer produced today
includes cache
X4 X4

X1 X1
Requesting Xn Xn – 2 Xn – 2
generates a miss and
the word Xn will be
Xn – 1 Xn – 1
brought from main
X2 X2
memory to cache
Xn

X3 X3

Issues: a. Before the reference to Xn b. After the reference to Xn

• How do we know that a data item is in cache?


• If so, How to find it?
Direct-Mapped Cache
Valid Bit Cache Tag Cache Data
Byte 3 Byte 2 Byte 1 Byte 0

Cache
• Worst case is to keep replacing
a block followed by a miss for it: Memory words can be
Ping Pong Effect mapped only to one
cache block
• To reduces misses:
– make the cache size bigger
– multiple entries for the same
Cache Index

00001 00101 01001 01101 10001 10101 11001 11101

Memory

Cache block address = (Block address) modulo (Number of cache blocks)


Accessing Cache
Address (showing bit positions)
• Cache Size depends on: 31 30 13 12 11 210
Byte

– # cache blocks 20 10
offset

Hit Data
– # address bits Tag
Index
– Word size
Index Valid Tag Data
• Example: 0
1
– For n-bit address, 4-byte 2

word & 1024 cache


blocks:
– cache size = 1021
1022

1024 [(n-10 -2) + 1 + 32] bit 1023


20 32

Valid bit
# cache
Tag Word
blocks
size
Cache with Multi-Word/Block
Address (showing bit positions)
31 16 15 4 32 1 0

16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data

4K
entries

16 32 32 32 32

Mux
32

• Takes advantage of spatial locality to improve performance


• Cache block address = (Block address) modulo (Number of cache
blocks)
• Block address = (byte address) / (bytes per block)
Determining Block Size
• Larger block size take advantage of spatial locality BUT:
– Larger block size means larger miss penalty:
• Takes longer time to fill up the block
– If block size is too big relative to cache size, miss rate will go up
• Too few cache blocks
• Average Access Time =
Hit Time * (1 - Miss Rate) + Miss Penalty * Miss Rate

Average
Miss Miss Access
Penalty Rate Time
Exploits Spatial Locality

Increased Miss Penalty


Fewer blocks: & Miss Rate
compromises
temporal locality

Block Size Block Size Block Size


Slide: Dave Patterson
Block Placement Hardware Complexity

Cache utilization

Direct mapped Set associative Fully associative


Block # 0 1 2 3 4 5 6 7 Set # 0 1 2 3

Data Data Data

1 1 1
Tag Tag Tag
2 2 2

Search Search Search

• Set number = (Block number) modulo (Number of sets in the cache)


• Increased flexibility of block placement reduces probability of cache misses
Fully Associative Cache
• Forget about the Cache Index
• Compare the Cache Tags of all cache entries in parallel
• Example: Block Size = 32 Byte blocks, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
31 4 0
Cache Tag (27 bits long) Byte Select
Ex: 0x01

Cache Tag Valid Bit Cache Data


X Byte 31 Byte 1 Byte 0

: :
X Byte 63 Byte 33 Byte 32

X
X

: : :
X
Slide: Dave Patterson
N-way Set Associative Cache
• N entries for each Cache Index
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared in parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block
Hit Slide: Dave Patterson
Locating a Block in
Associative Cache
Address
31 30 12 11 10 9 8 3210

22 8

Index V Tag Data V Tag Data V Tag Data V Tag Data


0
1
2

253
254
255
22 32

Tag
Tag size
size increases
increases with
with 4-to-1 multiplexor

higher
higher level
level of
of associativity
associativity
Hit Data
Handling Cache Misses
• Misses for read access always bring blocks from main memory
• Write access requires careful maintenance of consistency between
cache and main memory
• Two possible strategies for handling write access misses:
– Write through: The information is written to both the block in the cache and to
the block in the slower memory
• Read misses cannot result in writes
• No allocation of a cache block is needed
• Always combined with write buffers so that don’t wait for slow memory
– Write back: The information is written only to the block in the cache. The
modified cache block is written to main memory only when it is replaced
• Is block clean or dirty?
• No writes to slow memory for repeated write accesses
• Requires allocation of a cache block
Write Through via Buffering
Cache
Processor DRAM

Write Buffer
• Processor writes data into the cache and the write buffer
• Memory controller writes contents of the buffer to memory
• Increased write frequency can cause saturation of write buffer
• If CPU cycle time too fast and/or too many store instructions in a row:
– Store buffer will overflow no matter how big you make it
– The CPU Cycle Time get closer to DRAM Write Cycle Time
• Write buffer saturation can be handled by installing a second level (L2) cache

Cache L2
Processor DRAM
Cache

Write Buffer Slide: Dave Patterson


Block Replacement Strategy
• Straight forward for Direct Mapped since every block has only one
location
• Set Associative or Fully Associative:
– Random: pick any block
– LRU (Least Recently Used)
• requires tracking block reference
• for two-way set associative cache, reference bit attached to every block
• more complex hardware is needed for higher level of cache associativity
Associativity 2-way 4-way 8-way
LRU Random LRU Random LRU Random
Size
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

• Empirical results indicates less significance of replacement strategy with


increased cache sizes

Slide: Dave Patterson


Measuring Cache Performance
• To enhance cache performance, one can:
– reduce the miss rate (e.g. diminishing blocks collision
probability)
– reduce the miss penalty (e.g. adding multi-level caching)
– Enhance hit access time (e.g. simple and small cache)

CPU time = (CPU execution clock cycles + Memory - stall clock cycles) ¥ Clock cycle time

Memory - stall clock cycles = Read - stall cycles + Write - stall cycles

Read
Read - stall cycles = ¥ Read miss rate ¥ Read miss penalty
Program

For write-through scheme: Hard to control, assume


enough buffer size
Ê Write ˆ
Write - stall cycles = ÁÁ ¥ Write miss rate ¥ Write miss penalty ˜˜ + Write buffer stalls
Ë Program ¯
Example
Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%.
If a machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles
for all misses, determine how much faster a machine would run with a perfect cache that
never missed. Assume 36% combined frequencies for load and store instructions
Answer:
Assume number of instructions = I
The number of memory miss cycles = I ¥ 2% ¥ 40 = 0.8 ¥ I
Data miss cycles = I ¥ 36% ¥ 4% ¥ 40 = 0.56 ¥ I
Total number of memory-stall cycles = 0.8 I + 0.56 I = 1.36 I
The CPI with memory stalls = 2 + 1.36 = 3.36
CPU time with stalls I ¥ CPIstall ¥ Clock cycle CPIstall 3.36
= = =
CPU time with perfect cache I ¥ CPI perfect ¥ Clock cycle CPI perfect 2

What
What happen
happen ifif CPU
CPU gets
gets faster?
faster?
Multi-level Cache Performance
Suppose we have a 500 MHz processor with a base CPI of 1.0 with no cache misses.
Assume memory access time is 200 ns and average cache miss rate is 5%. Compare
performance after adding a second level cache, with access time 20 ns, that reduces
miss rate to main memory to 2%.
Answer:
The miss penalty to main memory = 200/cycle time
= 200 ¥ 500/1000 = 100 clock cycles

Effective CPI = Base CPI + memory-stall cycles/instr. = 1 + 5% ¥ 100 = 6.0

With two-level caches


The miss penalty for accessing 2nd cache = 20 ¥ 500/1000 = 10 clock cycles
Total CPI = Base CPI + main memory-stall cycles/instruction +
secondary cache stall cycles/instruction

= 1 + 2% ¥ 100 + 5% ¥ 10 = 3.5

You might also like