0% found this document useful (0 votes)
7 views

cache_concepts_memory

The document provides an overview of cache memory, including its definition, structure, and operations such as block read and write. It discusses cache mapping algorithms, specifically fully associative and direct mapping, along with their respective advantages and examples. Additionally, it covers replacement algorithms for managing cache when it is full, emphasizing the importance of locality in memory access.

Uploaded by

ashalibaba123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

cache_concepts_memory

The document provides an overview of cache memory, including its definition, structure, and operations such as block read and write. It discusses cache mapping algorithms, specifically fully associative and direct mapping, along with their respective advantages and examples. Additionally, it covers replacement algorithms for managing cache when it is full, emphasizing the importance of locality in memory access.

Uploaded by

ashalibaba123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

System

Fundamentals
Cache memories and
principle of locality

1
Cache CPU

Memory
CU ALU Clock

Registers

I/O BUS
Definition and Cache
Concepts

2
Cache: Definition
cache (kash), n.
• A hiding place used especially for storing provisions.
• A place for concealment and safekeeping, as of
valuables.
• The store of goods or valuables concealed in a hiding
place.
• Computer Science. A fast storage buffer in the central
processing unit (CPU) of a computer. In this sense, also
called cache memory.
Cache Memory
Small, fast SRAM-based memory managed
automatically in hardware (i.e., not software)
• Located on the CPU
• Hold frequently accessed blocks in main memory (RAM)

CPU
RAM
CU ALU Clock
Controller
Registers BUS
I/O BUS

data
Cache
General Cache Concept
Main memory is partitioned into equal
RAM
size chucks called a block
• Not physical partitioned Controller
• Block is a contiguous range of physical 0
address locations
1
2
For example, 3
• RAM partitioned into N blocks 4
5

N-1
Block Read Operation
Basic steps (for a block not in cache)
• CPU sends RAM controller the start address of
the block. RAM
• RAM controller puts a copy of the block on
the BUS Controller
• CPU controller reads the BUS and puts copy of
block in cache. 0
1
CPU
2
CU ALU Clock 3

Registers 4
3 (copy)
I/O BUS

5
Cache BUS

3 (copy) N-1
Block Write Operation
Basic steps:
RAM
• CPU controller puts a copy of the block (in
cache) on the BUS Controller
• RAM controller reads the BUS and
0
replaces block with the copy.
1
CPU
2
CU ALU Clock 3

Registers 4
3 (copy)
5
I/O BUS

Cache BUS

3 (copy) N-1
RAM

Main
Controller

Memory 1
2
3
Block Partitioning 4
5

N-1

8
Memory Partition: Blocks
Address (binary) Storage (1 Byte)
Simple example: 0000

• Physical address (m) is 4

block
0001
0010
bits 0011
• block size = 4 bytes 0100

block
0101
0110

Block address bits: 0111


1000
• 0000 to 0011 1001

• 0100 to 0111 block 1010


1011
• 1000 to 1011 1100

• 1100 to 1111
block

1101
1110
1111
Block Offset Bits
Block address bits: 00 00 Address (binary) Storage (1 Byte)
0000
• 0000 to 0011 00 01

block
0001
0010
00 10 0011

00 11 0100

block
0101

Red bit values are block 0110

offset (b) bits 0111


1000
• Reference a specific byte 1001

block
in the block, e.g., byte at
offset 01 1010
1011
• block size = 2b bytes 2! 1100
• Total number of blocks = "
2
block
1101
1110
1111
RAM

Cache
Controller

Mapping 1
2
3
Block Placement 4
5
Algorithms
N-1

11
Cache Mapping Algos
Three types:
• Fully associative
• Direct mapping
• Set associative

We’ll focus on Fully associative and direct mapping


• Set associative Comp 530 and 630.
• Set associative cache memories is used in modern
processors.

12
Fully Associative (FA)
Important concepts:
1. Block data could be anywhere in the cache
2. Flexible block storage strategy
3. Expensive to evict and replace a block
• Block replacement algorithm
FA Read Example
memory address (m) bits = 4 Address Data (1 Byte)
0000 0xA1
Block size = 4 bytes 0001 0xA2
Tag bits (t) = 2 msb lsb 0010 0xA3
0011 0xA4
Block offset (b) bits = 2 0100 0xB1

Valid bit: 0 = invalid 0101 0xB2


0110 0xB3

three-line Cache design 0111 0xB4


1000 0xC1

Block offset (b) 1001 0xC2


1010 0xC3
line valid tag (t) 00 01 10 11 1011 0xC4
0 0 1100 0xD1
1101 0xD2
1 0 1110 0xD3
2 0 1111 0xD4
FA Read Example (Cont.)
CPU: Load data instruction, put data at Address Data (1 Byte)
address 0111 in register $8 0000 0xA1
• Search tags -> cache miss (cache is 0001 0xA2
empty) 0010 0xA3
• Place a copy of block in any open line in 0011 0xA4
cache (i.e., valid = 0) 0100 0xB1
• Valid = 1 0101 0xB2
• Tag bits = 01 0110 0xB3
• Put 0xB4 in register $8 0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
line valid tag (t) 00 01 10 11
1011 0xC4
0 1 01 0xB1 0xB2 0xB3 0xB4 1100 0xD1
1101 0xD2
1 0
1110 0xD3
2 0 1111 0xD4
FA Read Example (Cont.)
CPU: Load data instruction, put Address Data (1 Byte)

data at address 0101 in register 0000


0001
0xA1
0xA2
$9 0010 0xA3
0011 0xA4
• Search tags -> cache hit 0100 0xB1
• Line 0 holds tag 01 (valid=1) 0101 0xB2
0110 0xB3
• Put 0xB2 in register $9 0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
line valid tag (t) 00 01 10 11
1011 0xC4
0 1 01 0xB1 0xB2 0xB3 0xB4 1100 0xD1
1101 0xD2
1 0
1110 0xD3
2 0 1111 0xD4
FA Read Example (Cont.)
CPU: Load data instruction, put data Address Data (1 Byte)
at address 1111 in register $8 0000 0xA1
• Search tags -> cache miss 0001 0xA2

• Place a copy of block in any open 0010 0xA3

line in cache (i.e., valid = 0) 0011 0xA4

• Valid = 1 0100 0xB1


0101 0xB2
• Tag bits = 11
0110 0xB3
• Put 0xD4 in register $8 0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
line valid tag (t) 00 01 10 11
1011 0xC4
0 1 01 0xB1 0xB2 0xB3 0xB4 1100 0xD1
1101 0xD2
1 1 11 0xD1 0xD2 0xD3 0xD4
1110 0xD3
2 0 1111 0xD4
FA Read Example (Cont.)
CPU: Load data instruction, put Address Data (1 Byte)
data at address 1011 in register 0000 0xA1

$8 0001
0010
0xA2
0xA3
• Search tags -> cache miss 0011 0xA4
• Oh snap, cache is full!! 0100 0xB1
0101 0xB2
• Must evict a valid line (i.e.,
0110 0xB3
invalidate) and replace with new
0111 0xB4
block data!
1000 0xC1
1001 0xC2
Block offset (b) 1010 0xC3
1011 0xC4
line valid tag (t) 00 01 10 11
1100 0xD1
0 1 01 0xB1 0xB2 0xB3 0xB4 1101 0xD2
1110 0xD3
1 1 11 0xD1 0xD2 0xD3 0xD4
1111 0xD4
2 1 00 0xA1 0xA2 0xA3 0xA4
FA Replacement Algorithm
When cache is full, and a line must evicted, how to
pick which line to replace?
• LRU (Least-recently used)
• replaces the line that has gone UNACCESSED the LONGEST
• favors the most recently accessed data
• FIFO/LRR (first-in, first-out/least-recently replaced)
• replaces the OLDEST line in cache
• favors recently loaded items over older STALE items
• Random
• replace some line at RANDOM
• no favoritism – uniform distribution
Direct Mapping (DM)
Important concepts:
1. Line bits determine the exact location of the block
data in cache.
2. Fairly rigid storage strategy (see 1 above)
3. Simple to evict and replace a block (see 1 above)
• No block replacement algorithm
DM Read Example
memory address (m) bits = 4
Address Data (1 Byte)
Block size = 4 bytes
0000 0xA1
Tag bits (t) = 1 msb lsb 0001 0xA2
Line bits (s) = 1 0010 0xA3
0011 0xA4
Block offset (b) bits = 2 0100 0xB1
Valid bit: 0 = invalid 0101 0xB2
0110 0xB3
0111 0xB4
Two-line Cache design
1000 0xC1
1001 0xC2
Block offset (b)
1010 0xC3
Line (s) valid tag (t) 00 01 10 11 1011 0xC4
1100 0xD1
0 0
1101 0xD2
1 0 1110 0xD3
1111 0xD4
DM Read Example (Cont.)
CPU: Load data instruction, put data at
address 0111 in register $8 Address
0000
Data (1 Byte)
0xA1
• Go to line and identify tag -> cache miss 0001 0xA2
• Put copy of block at line (s=1) in cache 0010 0xA3
(i.e., valid = 0) 0011 0xA4
0100 0xB1
• Valid = 1
0101 0xB2
• Tag bit = 0
0110 0xB3
• Put 0xB4 in register $8 0111 0xB4
1000 0xC1
1001 0xC2
Block offset (b)
1010 0xC3
Line (s) valid tag (t) 00 01 10 11 1011 0xC4
1100 0xD1
0 0
1101 0xD2
1 1 0 0xB1 0xB2 0xB3 0xB4 1110 0xD3
1111 0xD4
DM Read Example (Cont.)
CPU: Load data instruction, put data
at address 0101 in register $9
Address Data (1 Byte)
0000 0xA1

• Go to line and identify tag -> cache hit 0001 0xA2


0010 0xA3
• Line 1 holds tag 0 (valid=1) 0011 0xA4

• Put 0xB2 in register $9 0100 0xB1


0101 0xB2
0110 0xB3
0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
Line (s) valid tag (t) 00 01 10 11 1011 0xC4
0 0 1100 0xD1
1101 0xD2
1 1 0 0xB1 0xB2 0xB3 0xB4
1110 0xD3
1111 0xD4
DM Read Example (Cont.)
CPU: Load data instruction, put data at
address 0000 in register $8 Address Data (1 Byte)
0000 0xA1
• Go to line and identify tag -> cache miss 0001 0xA2
• Put copy of block at line (s=0) in cache 0010 0xA3
(i.e., valid = 0) 0011 0xA4
• Valid = 1 0100 0xB1
0101 0xB2
• Tag bit = 0
0110 0xB3
• Put 0xA1 in register $8 0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
Line (s) valid tag (t) 00 01 10 11 1011 0xC4
0 1 0 0xA1 0xA2 0xA3 0xA4 1100 0xD1
1101 0xD2
1 1 0 0xB1 0xB2 0xB3 0xB4
1110 0xD3
1111 0xD4
DM Read Example (Cont.)
CPU: Load data instruction, put data
at address 1011 in register $9 Address
0000
Data (1 Byte)
0xA1
• Go to line and identify tag -> cache 0001 0xA2
miss 0010 0xA3
0011 0xA4
• Line 0 is being used (valid=1)
0100 0xB1
• Must evict line 0 (i.e., invalidate) and 0101 0xB2
replace with new block data! 0110 0xB3
• Put 0xC4 in register $9 0111 0xB4
1000 0xC1
Block offset (b) 1001 0xC2
1010 0xC3
Line (s) valid tag (t) 00 01 10 11 1011 0xC4
0 1 1 0xC1 0xC2 0xC3 0xC4 1100 0xD1
1101 0xD2
1 1 0 0xB1 0xB2 0xB3 0xB4 1110 0xD3
1111 0xD4
Key Ideas
Keep data used often in a small fast SRAM
• “cache”
• access frequently
• on the CPU (fast!)

Keep all data in a bigger but slower DRAM


• “main memory”
• access rarely
• BUS transfers between CPU and RAM (slow!)
Cache R/W Operations
Read operations, very straight-forward!
• Most (80+%) memory operations are read.

Write operations, not straight-forward, two different policies:


• WRITE-THROUGH: CPU writes are cached, but also written to main
memory (stalling the CPU until write is completed). Memory always
holds the latest values.

• WRITE-BACK: CPU writes are cached, but not immediately written


to main memory. Main memory contents can become “stale”. Only
when a value has to be evicted from the cache, and only if it had
been modified (i.e., is “dirty”), it is written to main memory.

27
Cache: Bytes, Shorts, and Words
In general, the size of a block in physical memory is
one (or more) words!
• Never put a single short or byte from DRAM into cache
• Instead, put the entire word, much more efficient!
• Why memory alignment is important!
Once the word is in cache, the byte
or short can be accessed through
hardware operations:
• i.e., bit mask and shifting

28
address

Principle stack

of Locality data

Block organization
program

time

29
Principle of Locality
Def: Programs tend to use data and instructions in
memory that have addresses near or equal to those
they have used recently!

Temporal locality: Recently


referenced blocks are likely to be
referenced again in the near future.

Spatial locality: Blocks with nearby


addresses tend to be referenced close Why blocks
together in time. work!
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;

Data references
• Reference array elements in succession (stride-1 Spatial locality
reference pattern).
• Reference variable sum each iteration. Temporal locality

Instruction references Spatial locality


• Reference instructions in sequence.
• Cycle through loop repeatedly. Temporal locality
Summary
Programs that repeatedly reference the same variables
enjoy good temporal locality.

For programs with stride-k reference patterns, the


smaller the stride the better the spatial locality.
• Programs with stride-1 reference patterns have good spatial
locality.
• Programs that hop around memory with large strides have
poor spatial locality.

Loops have good temporal and spatial locality with


respect to variables and instruction fetches.
• The smaller the loop body and the greater the number of
loop iterations, the better the locality.

32

You might also like