0% found this document useful (0 votes)
26 views66 pages

Memory - System - Cache - Memory and Virtual Memory

so

Uploaded by

Muin Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views66 pages

Memory - System - Cache - Memory and Virtual Memory

so

Uploaded by

Muin Sayyad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

 Contents

 Exploiting Memory Hierarchy


 Principle of Locality
 Basics of Cache Memory Organization
 Types of Cache Memory
 Memory Interleaving
 Levels of Cache Memory
 Virtual Memory System
Memory Technology
 Static RAM (SRAM)
 0.5ns – 2.5ns, $2000 – $5000 per GB
 Dynamic RAM (DRAM)
 50ns – 70ns, $20 – $75 per GB
 Magnetic disk
 5ms – 20ms, $0.20 – $2 per GB
 Ideal memory
 Access time of SRAM
 Capacity and cost/GB of disk
Memories: Review
 DRAM (Dynamic Random Access Memory):
 value is stored as a charge on capacitor that must be
periodically refreshed, which is why it is called dynamic
 very small – 1 transistor per bit – but factor of 5 to 10
slower than SRAM
 used for main memory
 SRAM (Static Random Access Memory):
 value is stored on a pair of inverting gates that will exist
indefinitely as long as there is power, which is why it is
called static
 very fast but takes up more space than DRAM – 6
transistors per bit
Speed CPU Size Cost ($/bit)

 used for cache


Fastest Memory Smallest Highest

Memory

Slowest Memory Biggest Lowest


Memory Hierarchy
 Users want large and fast memories…
 expensive and they don’t like to pay…
 Make it seem like they have what they want…
 memory hierarchy
 hierarchy is inclusive, every level is subset of lower level
 performance depends on hit rates

CPU

Processor
Block of data
(unit of data copy) Increasing distance
Level 1
from the CPU in
access ti me

Levels in the Level 2


Data are transferred memory hierarchy

Level n

Size of the memory at each level


Locality
 Locality is a principle that makes having a memory
hierarchy a good idea
 If an item is referenced then because of
 temporal locality: it will tend to be again referenced soon
 spatial locality: nearby items will tend to be referenced
soon
Hit and Miss
 Focus on any two adjacent levels – called, upper (closer
to CPU) and lower (farther from CPU) – in the memory
hierarchy, because each block copy is always between
two adjacent levels
 Terminology:
 block: minimum unit of data to move between levels

 hit: data requested is in upper level

 miss: data requested is not in upper level

 hit rate: fraction of memory accesses that are hits

(i.e., found at upper level)


 miss rate: fraction of memory accesses that are not

hits

miss rate = 1 – hit rate
 hit time: time to determine if the access is indeed a

hit + time to access and deliver the data from the


upper level to the CPU
 miss penalty: time to determine if the access is a

miss + time to replace block at upper level with


corresponding block at lower level + time to deliver
the block to the CPU
Caches
 By simple example
 assume block size = one word of data

X4 X4
Reference to Xn
X1 X1
causes miss so
Xn – 2 Xn – 2
it is fetched from
memory
Xn – 1 Xn – 1

X2 X2
Xn
X3 X3

a. Before the reference to Xn b. After the reference to Xn


 Issues:
 how do we know if a data item is in the cache?

 if it is, how do we find it?

 if not, what do we do?

 Solution depends on cache addressing scheme…


Direct Mapped
Cache
 Addressing scheme in direct mapped cache:
 cache block address = memory block address mod cache size
(unique)
 if cache size = 2m, cache address = lower m bits of n-bit memory
address
 remaining upper n-m bits kept kept as tag bits at each cache
Cache
block

000
001
010
011

111
100
101
110
 also need a valid bit to recognize valid entry

00001 00101 01001 01101 10001 10101 11001 11101

Memory
Accessing Cache
 Example:
(0) Initial state: (1) Address referred 10110 (miss):
Index V Tag Index V Tag Data
Data 000 N
000 N 001 N
001 N 010 N
010 N 011 N
011 N 100 N
100 N 101 N
101 N 110 Y 10 Mem(10110)
110 N 111 N
111 N
(3) Address referred 10110 (hit):
(2) Address referred 11010 (miss):
Index V Tag Data Index V Tag Data
000 N 000 N
001 N 001 N
010 Y 11 Mem(11010) 010 Y 11 Mem(11010)
011 N 011 N
100 N 100 N to CPU

101 N 101 N
110 Y 10 Mem(10110) 110 Y 10 Mem(10110)
111 N 111 N
(4) Address referred 10010 (miss):
Index V Tag Data
000 N
001 N
010 Y 10 Mem(10010)
011 N
100 N
101 N
110 Y 10 Mem(10110)
111 N
Address division for DM cache with 1K one word blocks ?
Direct Mapped Cache
 MIPS style:
A d dre ss ( s h o w ing b it p os itio ns )
31 30 1 3 12 11 2 1 0
Byte
offse t

20 10
H it D a ta
Tag

Ind e x

In de x V a lid T ag D a ta
0
1
2

1021
1022
1023
20 32

Cache with 1024 1-word blocks: byte offset


(least 2 significant bits) is ignored and
next 10 bits used to index into cache

What kind of locality are we taking advantage of?


processor)
Address (showing bit positions)
3 1 30 17 1 6 15 54 32 10

16 14 Byte
offset
H it D ata

16 bits 32 bits
Valid Tag D ata

16K
entries

16 32

Cache with 16K 1-word blocks: byte offset


(least 2 significant bits) is ignored and
next 14 bits used to index into cache
Cache Write Hit/Miss
Write-through scheme
 on write hit: replace data in cache and memory with

every write hit to avoid inconsistency


 on write miss: write the word into cache and memory –

obviously no need to read missed word from memory!


 Write-through is slow because of always required

memory write

performance is improved with a write buffer where
words are stored while waiting to be written to
memory – processor can continue execution until
write buffer is full

when a word in the write buffer completes writing
into main memory that buffer slot is freed and
becomes available for future writes

DEC 3100 write buffer has 4 words
 Write-back scheme
 write the data block only into the cache and write-back

the block to main only when it is replaced in cache


 more efficient than write-through, more complex to

implement
Direct Mapped Cache: Taking Advantage of Spatial
Locality
 Taking advantage of spatial locality with larger blocks:

Address (showing bit positions)


31 16 1 5 4 32 1 0

16 12 2 Byte
H it T ag D a ta
offset
Index Block offset
1 6 bits 12 8 bits
V T ag D ata

4K
entrie s

16 32 32 32 32

M ux
32

Cache with 4K 4-word blocks: byte offset (least 2 significant bits) is


ignored, next 2 bits are block offset, and the next 12 bits are used to
Direct Mapped Cache: Taking Advantage
of Spatial Locality
 Miss rate falls at first with increasing block size as
expected, but, as block size becomes a large fraction of
total cache size, miss rate may go up because
 there are few blocks
 competition for blocks increases
 blocks get ejected before most of their words are accessed
(thrashing in cache)
40%

35%

30%

Miss rate vs. block size for 25%


Miss rate

various cache sizes 20%

15%

10%

5%

0%
4 16 64 256
Block size (bytes) 1 KB
8 KB
16 KB
64 KB
256 KB
Example

 How many total bits are required for a direct-mapped cache with
128 KB of data and 1-word block size, assuming a 32-bit
address?
Example
 How many total bits are required for a direct-mapped cache with
128 KB of data and 1-word block size, assuming a 32-bit
address?

 Cache data = 128 KB = 217 bytes = 215 words = 215


blocks
 Cache entry size = block data bits + tag bits +
valid bit
= 32 + (32 – 15 – 2) + 1 = 48 bits
 Therefore, cache size = 215  48 bits =
215  (1.5  32) bits = 1.5  220 bits = 1.5 Mbits
 data bits in cache = 128 KB  8 = 1 Mbits

 total cache size/actual cache data = 1.5


Example Problem
 How many total bits are required for a direct-mapped cache with
128 KB of data and 4-word block size, assuming a 32-bit
address?

 Cache size = 128 KB = 217 bytes = 215 words = 213


blocks
 Cache entry size = block data bits + tag bits +
valid bit
= 128 + (32 – 13 – 2 – 2) + 1 = 144 bits
 Therefore, cache size = 213  144 bits =
213  (1.125  128) bits = 1.125  220 bits =
1.125 Mbits
 data bits in cache = 128 KB  8 = 1 Mbits

 total cache size/actual cache data = 1.125 Mbits


Example Problem
 Consider a cache with 64 blocks and a block size of 16 bytes. What
block number does byte address 1200 map to?

 As block size = 16 bytes:


byte address 1200  block address 1200/16  = 75
 As cache size = 64 blocks:
block address 75  cache block (75 mod 64) = 11
Block Size Considerations
 Larger blocks should reduce miss rate
 Due to spatial locality
 But in a fixed-sized cache
 Larger blocks  fewer of them

More competition  increased miss rate
 Larger miss penalty
 Can override benefit of reduced miss rate
 Early restart and critical-word-first can help
Interleaving
 Divides the memory system into a number of
memory modules.
 Each module has its own Address Buffer Register (ABR) and Data Buffer
Register (DBR).
 Arranges addressing so that successive words in
the address space are placed in different modules.
 When requests for memory access involve
consecutive addresses, the access will be to
different modules.
 Since parallel access to these modules is possible,
the average rate of fetching words from the Main
Memory can be increased.
Methods of address
layouts
k bits m bits
m bits k bits
Module Address in module MM address
Address in module Module MM address

ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR ABR DBR

Module Module Module Module Module Module


k
0 i n- 1 0 i 2 -1

 Consecutive words are placed in • Consecutive words are located


a module.
in consecutive modules.
 High-order k bits of a memory
address determine the module.
• Consecutive addresses can be
 Low-order m bits of a memory located in consecutive
address determine the word modules.
within a module. • While transferring a block of
 When a block of words is data, several memory modules
transferred from main memory can be kept busy at the same
to cache, only one module is time.
busy at a time.
Improving Cache Performance by Increasing
Bandwidth
 Assume:
 cache block of 4 words
 1 clock cycle to send address (1 bus trip)
 15 clock cycles for each memory data access
 1 clock cycle to send data (1 bus trip)
CPU CPU CPU

Multiplexor
Cache Cache
Cache
Bus

Bus Bus Bus


Proce- Memory Memory Memory Memory
ssor bank 0 bank 1 bank 2 bank 3
Memory Memory Memory Memory
Memory
bank 0 bank 1 bank 2 bank 3

Interleaved memory units


Memory b. Wide memory organization c. Interleaved memory organization compete for bus

4 word wide memory and bus 4 word wide memory only

1 + 1*15 +1*1 = 17 cycles 1 +1*15 + 4*1 = 20 cycles


a. One-word-wide
memory organization
Miss penalties
1 + 4*15 + 4*1 = 65 cycles
Decreasing Miss Rates with Associative Block
Placement

 Direct mapped: one unique cache location for each


memory block
 cache block address = memory block address mod cache size
 Fully associative: each memory block can locate anywhere
in cache
 all cache entries are searched (in parallel) to locate block
 Set associative: each memory block can place in a unique
set of cache locations – if the set is of size n it is n-way
set-associative
 cache set address = memory block address mod number of
sets in cache
 all cache entries in the corresponding set are searched (in
parallel) to locate block
 Increasing degree of associativity
 reduces miss rate
 increases hit time because of the parallel search and then
fetch
Decreasing Miss Rates with Associative Block
Placement
Direct Mapped 2-way Set Associative Fully Associative
Direct mapped Set associative Fully associative
Block # 0 1 2 3 4 5 6 7 Set # 0 1 2 3

Data Data Data

12 mod 8 = 4 12 mod 4 = 0

1 1 1
Tag Tag Tag
2 2 2

Search Search Search

Location of a memory block with address 12 in a cache


with 8 blocks with different degrees of associativity
Decreasing Miss Rates with Associative Block
Placement
One-way set set
One-way associative
associative
(direct mapped)
Block Tag Data
0
Two-way set associative
1
Set Tag Data Tag Data
2
3 0

4 1

5 2

6 3

Four-way set associative


Set Tag Data Tag Data Tag Data Tag Data
0
1

Eight-way set associative (fully associative)


Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Configurations of an 8-block cache with different degrees of associativity


Example
 Find the number of misses for a cache with four 1-word blocks given
the following sequence of memory block accesses:
0, 8, 0, 6, 8,
for each of the following cache configurations
1. direct mapped
2. 2-way set associative (use LRU replacement policy)
3. fully associative

 Note about LRU replacement


 in a 2-way set associative cache LRU replacement
can be implemented with one bit at each set whose
value indicates the most recently referenced block
Solution
 1 (direct-mapped)
Block address Cache block
0 0 (= 0 mod 4)
6 2 (= 6 mod 4)
8 0 (= 8 mod 4)
Block address translation in direct-mapped cache

Address of memory Hit or Contents of cache blocks after reference


block accessed miss 0 1 2 3
0 miss Memory[0]
8 miss Memory[8]
0 miss Memory[0]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
 5 misses
Solution (cont.)

 2 (two-way set-associative)
Block address Cache set
0 0 (= 0 mod 2)
6 0 (= 6 mod 2)
8 0 (= 8 mod 2)
Block address translation in a two-way set-associative cache

Address of memory Hit or Contents of cache blocks after reference


block accessed miss Set 0 Set 0 Set 1 Set 1
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[6]
8 miss Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added
 Four misses
Solution (cont.)
 3 (fully associative)

Address of memory Hit or Contents of cache blocks after reference


block accessed miss Block 0 Block 1 Block 2 Block 3
0 miss Memory[0]
8 miss Memory[0] Memory[8]
0 hit Memory[0] Memory[8]
6 miss Memory[0] Memory[8] Memory[6]
8 hit Memory[0] Memory[8] Memory[6]
Cache contents after each reference – red indicates new entry added

 3 misses
Implementation of a Set-Associative
Cache Ad dr es s
31 3 0 1 2 11 10 9 8 3 2 1 0

22 8

I nd ex V Ta g D ata V Ta g D a ta V T ag D at a V T ag D at a
0
1
2
Set
253
254
255
22 32

4 - to - 1 m ultip le x o r

Hit D a ta

4-way set-associative cache with 4 comparators and one 4-to-


1 multiplexor:size of cache is 1K blocks = 256 sets * 4-block
set size
Performance with Set-Associative Caches
15%

12%

9%
Miss rate

6%

3%

0%
O n e -w a y T w o-w a y F ou r-w a y E ig h t- w a y
A sso c ia tiv ity 1 KB 16 KB
2 KB 32 KB
4 KB 64 KB
8 KB 1 28 KB

Miss rates for each of eight cache sizes


with increasing associativity:
data generated from SPEC92 benchmarks
with 32 byte block size for all caches
Replacement Policy
 Direct mapped: no choice
 Set associative

Prefer non-valid entry, if there is one

Otherwise, choose among entries in the set
 Least-recently used (LRU)

Choose the one unused for the longest time

Simple for 2-way, manageable for 4-way, too hard
beyond that
 Random

Gives approximately the same performance
as LRU for high associativity
Multilevel Caches
 Primary cache attached to CPU
 Small, but fast
 Level-2 cache services misses from
primary cache
 Larger, slower, but still faster than main
memory
 Main memory services L-2 cache misses
 Some high-end systems include L-3
cache
Decreasing Miss Penalty with Multilevel
Caches
 Add a second-level cache
 primary cache is on the same chip as the processor

 use SRAMs to add a second-level cache, between

main memory and the first-level cache


 if miss occurs in primary cache second-level cache is

accessed
 if data is found in second-level cache, miss penalty of

L1 is access time of second-level cache which is


much less than main memory access time
 if miss occurs again at second-level then main

memory access is required and large miss penalty is


incurred
Average memory access
time
 Considering L1 cache and main
memory.

 Where tave is average memory access


time
 h is the hit rate in L1 cache
 C is the time to access information in L1
cache
 M is the time to access information in
Calculating Average
memory access time
Calculation of Average memory access
time with second level of cache
 Avg access time in a system with 2 levels of caches is
T ave = h1c1+(1-h1)h2c2+(1-h1)(1-h2)M
Memory Management
Program Execution
Sharing RAM
Single contiguous model
Partition Model
Allocating/Deallocating
Processes
Partition Model
(Fragmentation)
Partition Model(Finding the
right fit)
Partition
Model(Deallocation)
Limitations
Processor

Virtual address

Data MMU

Physical address

Cache

Data Physical address

Main memory

DMA transfer

Disk storage

Figure 5.26. Virtual memory organization.


Virtual Memory
Virtual Memory
Virtual Memory
Virtual Memory(MMU)
Process addresses
MMU
MMU mapping
MMU mapping 32 bit
systems
MMU mapping 32 bit
systems
Working of virtual memory
Working of virtual memory
Working of virtual memory
Working of virtual memory
Working of virtual memory
Working of virtual memory

You might also like