0% found this document useful (0 votes)
58 views26 pages

Chapter 5.1-5.6 Memory

Uploaded by

tippars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views26 pages

Chapter 5.1-5.6 Memory

Uploaded by

tippars
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Memory Hierarchy and Caches

1
Cache Memories in the Datapath
Imm16

Imm
Ext
ALU result 32
0
32
1

A
D-Cache

Register File
I-Cache Rs 5 BusA 2
Instruction RA 3
A 0

Data
L

R
0 Address
Instruction Rt 5 32
RB BusB 0 U Data_out
PC

1 Address 1 1 1

B
2 2 0
Data_in

D
RW BusW
3 32
32
0

Rd2

Rd3

Rd4
1
Rd

clk
Instruction Block
Block Address

Block Address

D-Cache miss
I-Cache miss

I-Cache miss or D-Cache miss

Data Block
causes pipeline to stall

Interface to L2 Cache or Main Memory

2
Four Basic Questions on Caches
• Q1: Where can a block be placed in a cache?
– Block placement
– Direct Mapped, Set Associative, Fully Associative
• Q2: How is a block found in a cache?
– Block identification
– Block address, tag, index
• Q3: Which block should be replaced on a cache miss?
– Block replacement
– FIFO, Random, LRU
• Q4: What happens on a write?
– Write strategy
– Write Back or Write Through cache (with Write Buffer) 3
Inside a Cache Memory
Address Address
Cache Main
Processor Data Data
Memory Memory

N Cache Blocks
Tags Address Tag 0 Cache Block 0

identify Address Tag 1 Cache Block 1

blocks in ... ...


the cache Tag N – 1 Cache Block N – 1

 Cache Block (or Cache Line)


 Unit of data transfer between main memory and a cache
 Large block size  Less tag overhead + Burst transfer from DRAM
 Typically, cache block size = 64 bytes in recent caches

4
Block Placement: Direct Mapped
• Block: unit of data transfer between cache and memory
• Direct Mapped Cache:
– A block can be placed in exactly one location in the cache

000
001
010

100
101
011

110
111
In this example:

Cach
Cache index =
least significant 3 bits of

e
Memory address

Memory
Main
00000
00001
00010

00100
00101

01000
01001
01010

10000
10001
10010

10100
10101
00011

00110

01011
01100
01101

10011

10110

11000
11001
11010
00111

01110

10111

11011
11100
11101
01111

11110
11111
5
Direct-Mapped Cache
• A memory address is divided into
Block Address
– Block address: identifies block in memory
Tag Index offset
– Block offset: to access bytes within a block

• A block address is further divided into V Tag Block Data

– Index: used for direct cache access


– Tag: most-significant bits of block address
Index = Block Address mod Cache Blocks
• Tag must be stored also inside cache
=
– For block identification
Data

• A valid bit is also required to indicate Hit

– Whether a cache block is valid or not 6


Direct Mapped Cache – cont’d
• Cache hit: block is stored inside cache
– Index is used to access cache block Block Address

– Address tag is compared against stored tag Tag Index offset

– If equal and cache block is valid then hit


V Tag Block Data
– Otherwise: cache miss
• If number of cache blocks is 2n
– n bits are used for the cache index
• If number of bytes in a block is 2b
– b bits are used for the block offset
=
• If 32 bits are used for an address
Data
– 32 – n – b bits are used for the tag Hit

• Cache data size = 2n+b bytes 7


Mapping an Address to a Cache Block
• Example
– Consider a direct-mapped cache with 256 blocks
– Block size = 16 bytes
– Compute tag, index, and byte offset of address: 0x01FFF8AC
• Solution Block Address

20 8 4
– 32-bit address is divided into: Tag Index offset
• 4-bit byte offset field, because block size = 24 = 16 bytes
• 8-bit cache index, because there are 28 = 256 blocks in cache
• 20-bit tag field
– Byte offset = 0xC = 12 (least significant 4 bits of address)
– Cache index = 0x8A = 138 (next lower 8 bits of address)
– Tag = 0x01FFF (upper 20 bits of address)

8
Example on Cache Placement & Misses
• Consider a small direct-mapped cache with 32 blocks
– Cache is initially empty, Block size = 16 bytes
– The following memory addresses (in decimal) are referenced:
1000, 1004, 1008, 2548, 2552, 2556.
– Map addresses to cache blocks and indicate whether hit or miss
23 5 4
• Solution: Tag Index offset

– 1000 = 0x3E8 cache index = 0x1E Miss (first access)


– 1004 = 0x3EC cache index = 0x1E Hit
– 1008 = 0x3F0 cache index = 0x1F Miss (first access)
– 2548 = 0x9F4 cache index = 0x1F Miss (different tag)
– 2552 = 0x9F8 cache index = 0x1F Hit
– 2556 = 0x9FC cache index = 0x1F Hit
9
Fully Associative Cache
• A block can be placed anywhere in cache  no indexing
• If m blocks exist then
– m comparators are needed to match tag
Address
– Cache data size = m  2b bytes
Tag offset

V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data

= = = =

mux
m-way associative Data
Hit

10
Set-Associative Cache
• A set is a group of blocks that can be indexed
• A block is first mapped onto a set
– Set index = Block address mod Number of sets in cache

• If there are m blocks in a set (m-way set associative) then


– m tags are checked in parallel using m comparators

• If 2n sets exist then set index consists of n bits


• Cache data size = m  2n+b bytes (with 2b bytes per block)
– Without counting tags and valid bits

• A direct-mapped cache has one block per set (m = 1)


• A fully-associative cache has one set (2n = 1 or n = 0)
11
Set-Associative Cache Diagram
Address Tag Index offset

V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data

= = = =

mux
m-way set-associative Hit
Data

12
Write Policy
• Write Through:
– Writes update cache and lower-level memory
– Cache control bit: only a Valid bit is needed
– Memory always has latest data, which simplifies data coherency
– Can always discard cached data when a block is replaced
• Write Back:
– Writes update cache only
– Cache control bits: Valid and Modified bits are required
– Modified cached data is written back to memory when replaced
– Multiple writes to a cache block require only one write to memory
– Uses less memory bandwidth than write-through and less power
– However, more complex to implement than write through
13
Write Miss Policy
• What happens on a write miss?
• Write Allocate:
– Allocate new block in cache
– Write miss acts like a read miss, block is fetched and updated
• No Write Allocate:
– Send data to lower-level memory
– Cache is not modified
• Typically, write back caches use write allocate
– Hoping subsequent writes will be captured in the cache
• Write-through caches often use no-write allocate
– Reasoning: writes must still go to lower level memory

14
Write Buffer
• Decouples the CPU write from the memory bus writing
– Permits writes to occur without stall cycles until buffer is full

• Write-through: all stores are sent to lower level memory


– Write buffer eliminates processor stalls on consecutive writes

• Write-back: modified blocks are written when replaced


– Write buffer is used for evicted blocks that must be written back

• The address and modified data are written in the buffer


– The write is finished from the CPU perspective
– CPU continues while the write buffer prepares to write memory

• If buffer is full, CPU stalls until buffer has an empty entry


15
What Happens on a Cache Miss?
• Cache sends a miss signal to stall the processor
• Decide which cache block to allocate/replace
– One choice only when the cache is directly mapped
– Multiple choices for set-associative or fully-associative cache
• If block to be replaced is modified then write it back
– Modified block is moved into a Write Buffer
– Otherwise, block to be replaced can be simply discarded
• Transfer the block from lower level memory to this cache
– Set the valid bit and the tag field from the upper address bits
• Restart the instruction that caused the cache miss
• Miss Penalty: clock cycles to process a cache miss
16
Replacement Policy
• Which block to be replaced on a cache miss?
• No selection alternatives for direct-mapped caches
• m blocks per set to choose from for associative caches
• Random replacement
– Candidate blocks are randomly selected
– One counter for all sets (0 to m – 1): incremented on every cycle
– On a cache miss replace block specified by counter
• First In First Out (FIFO) replacement
– Replace oldest block in set
– One counter per set (0 to m – 1): specifies oldest block to replace
– Counter is incremented on a cache miss

17
Replacement Policy – cont’d
• Least Recently Used (LRU)
– Replace block that has been unused for the longest time
– Order blocks within a set from least to most recently used
– Update ordering of blocks on each cache hit
– With m blocks per set, there are m! possible permutations

• Pure LRU is too costly to implement when m > 2


– m = 2, there are 2 permutations only (a single bit is needed)
– m = 4, there are 4! = 24 possible permutations
– LRU approximation is used in practice

• For large m > 4,


Random replacement can be as effective as LRU 18
Hit Rate and Miss Rate
• Hit Rate = Hits / (Hits + Misses)
• Miss Rate = Misses / (Hits + Misses)
• I-Cache Miss Rate = Miss rate in the Instruction Cache
• D-Cache Miss Rate = Miss rate in the Data Cache
• Example:
– Out of 1000 instructions fetched, 150 missed in the I-Cache
– 25% are load-store instructions, 50 missed in the D-Cache
– What are the I-cache and D-cache miss rates?

• I-Cache Miss Rate = 150 / 1000 = 15%


• D-Cache Miss Rate = 50 / (25% × 1000) = 50 / 250 = 20%19
Memory Stall Cycles
• The processor stalls on a Cache miss
– When fetching instructions from the Instruction Cache (I-cache)
– When loading or storing data into the Data Cache (D-cache)

Memory stall cycles = Combined Misses  Miss Penalty


• Miss Penalty: clock cycles to process a cache miss
Combined Misses = I-Cache Misses + D-Cache Misses
I-Cache Misses = I-Count × I-Cache Miss Rate
D-Cache Misses = LS-Count × D-Cache Miss Rate
LS-Count (Load & Store) = I-Count × LS Frequency
• Cache misses are often reported per thousand instructions
20
Memory Stall Cycles Per Instruction
• Memory Stall Cycles Per Instruction =

Combined Misses Per Instruction × Miss Penalty


• Miss Penalty is assumed equal for I-cache & D-cache
• Miss Penalty is assumed equal for Load and Store
• Combined Misses Per Instruction =

I-Cache Miss Rate + LS Frequency × D-Cache Miss Rate


• Therefore, Memory Stall Cycles Per Instruction =

I-Cache Miss Rate × Miss Penalty +


LS Frequency × D-Cache Miss Rate × Miss Penalty
21
Example on Memory Stall Cycles
• Consider a program with the given characteristics
– Instruction count (I-Count) = 106 instructions
– 30% of instructions are loads and stores
– D-cache miss rate is 5% and I-cache miss rate is 1%
– Miss penalty is 100 clock cycles for instruction and data caches
– Compute combined misses per instruction and memory stall cycles
• Combined misses per instruction in I-Cache and D-Cache
– 1% + 30%  5% = 0.025 combined misses per instruction
– Equal to 25 misses per 1000 instructions
• Memory stall cycles
– 0.025  100 (miss penalty) = 2.5 stall cycles per instruction
– Total memory stall cycles = 106  2.5 = 2,500,000 22
CPU Time with Memory Stall Cycles

CPU Time = I-Count × CPIMemoryStalls × Clock Cycle

CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction

• CPIPerfectCache = CPI for ideal cache (no cache misses)

• CPIMemoryStalls = CPI in the presence of memory stalls

• Memory stall cycles increase the CPI

23
Example on CPI with Memory Stalls
• A processor has CPI of 1.5 without any memory stalls
– Cache miss rate is 2% for instruction and 5% for data
– 20% of instructions are loads and stores
– Cache miss penalty is 100 clock cycles for I-cache and D-cache

• What is the impact on the CPI?


• Answer: Instruction data

Mem Stalls per Instruction = 0.02×100 + 0.2×0.05×100 = 3


CPIMemoryStalls = 1.5 + 3 = 4.5 cycles per instruction
CPIMemoryStalls / CPIPerfectCache = 4.5 / 1.5 = 3
Processor is 3 times slower due to memory stall cycles
24
Average Memory Access Time
• Average Memory Access Time (AMAT)

AMAT = Hit time + Miss rate × Miss penalty


• Time to access a cache for both hits and misses
• Example: Find the AMAT for a cache with
– Cache access time (Hit time) of 1 cycle = 2 ns
– Miss penalty of 20 clock cycles
– Miss rate of 0.05 per access

• Solution:
AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
Without the cache, AMAT will be equal to Miss penalty = 20 cycles
25
Improving Cache Performance
• Average Memory Access Time (AMAT)

AMAT = Hit time + Miss rate * Miss penalty

• Used as a framework for optimizations


• Reduce the Hit time
– Small and simple caches

• Reduce the Miss Rate


– Larger cache size, higher associativity, and larger block size

• Reduce the Miss Penalty


– Multilevel caches
26

You might also like