0% found this document useful (0 votes)
20 views65 pages

IT3030E CA Chap6 Memory

The document discusses memory hierarchy and caching techniques. It describes how memory can be organized into multiple levels, with faster but smaller memory closer to the processor. The principles of locality are exploited by bringing frequently used data and instructions closer to the CPU. Caching uses a small fast memory (cache) between the CPU and main memory. If a memory access hits in the cache it is faster, while a miss requires loading a block from main memory into cache with penalty. Direct mapping is introduced as a basic caching technique.

Uploaded by

htdat181203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views65 pages

IT3030E CA Chap6 Memory

The document discusses memory hierarchy and caching techniques. It describes how memory can be organized into multiple levels, with faster but smaller memory closer to the processor. The principles of locality are exploited by bringing frequently used data and instructions closer to the CPU. Caching uses a small fast memory (cache) between the CPU and main memory. If a memory access hits in the cache it is faster, while a miss requires loading a block from main memory into cache with penalty. Direct mapping is introduced as a basic caching technique.

Uploaded by

htdat181203
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Chapter 6: Memory

Ngo Lam Trung, Pham Ngoc Hung

[with materials from Computer Organization and Design, MK


and M.J. Irwin’s presentation, PSU 2008]

IT3030E, Fall 2022 1


Content
❑ Memory hierarchy
❑ Principal of locality
❑ Cache
❑ Virtual memory

IT3030E, Fall 2022 2


Memory
❑ Memory: where data are stored.

Why is memory critical to performance?


IT3030E, Fall 2022 3
Memory technology (2012)
❑ Static RAM (SRAM)
0.5ns – 2.5ns, $500 – $1000 per GB

❑ Dynamic RAM (DRAM)


50ns – 70ns, $10 – $20 per GB

❑ Flash memory
5,000ns – 50,000ns, $0.75 – $1 per GB

❑ Magnetic memory
5,000,000ns – 20,000,000ns, $0.05 – $0.1 per GB

❑ Fact:
Large memories are slow
Fast memories are small (and expensive)

IT3030E, Fall 2022 4


A Typical Memory Hierarchy

On-Chip Components
Control

Cache Cache
Secondary

Instr Data
Second
ITLB DTLB Level Main Memory
Memory (Disk)
RegFile

Datapath Cache
(SRAM) (DRAM)

Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s


Size (bytes): 100’s 10K’s M’s G’s T’s
Cost: highest lowest

❑ How to get an ideal memory


❑ As fast as SRAM
❑ As cheap as disk?
IT3030E, Fall 2022 5
The Memory Hierarchy: Locality Principal
❑ C program

int x[1000], temp; Data memory at location of


for (i = 0; i < 999; i++) temp and x are accessed
for (j = i+1; j < 1000; j++)
multiple times
if (x[i] < x[j])
{ Instruction memory at
temp = x[i]; location of the two for
x[i] = x[j]; loops are used repeatedly
x[j] = temp;
}

IT3030E, Fall 2022 6


The Memory Hierarchy: Locality Principal

❑ Temporal Locality (locality in time)


If a memory location is referenced then it will tend to be
referenced again soon
 Keep most recently accessed data items closer to the processor

❑ Spatial Locality (locality in space)


If a memory location is referenced, the locations with nearby
addresses will tend to be referenced soon
 Move blocks consisting of contiguous words closer to the
processor

IT3030E, Fall 2022 7


Hierarchical memory access

❑ Data are stored in multiple levels.


High level: fast but small
Low level: slow but large

❑ Data are transferred in units of


block (of multiple words) between
levels, through the hierarchy.
❑ Frequently used data are stored
closer to processor.

IT3030E, Fall 2022 8


Hierarchical memory access

❑ Associative data access:


Processor access data in lower level
Data transfer from lower level to
processor via upper level(s)

❑ If accessed data is present in


upper level
Hit: access satisfied by upper level
- Hit ratio: hits/accesses

❑ If accessed data is absent


Miss: block copied from lower level
- Time taken: miss penalty
- Miss ratio: misses/accesses
= 1 – hit ratio
Then accessed data supplied from
IT3030E, Fall 2022
upper level 9
The Memory Hierarchy: Terminology
❑ Hit: data is in some block in the upper level (Blk X)
Hit Rate: fraction of memory accesses found in upper level
Hit Time: Time to access the upper level which consists of
- RAM access time + Time to determine hit/miss
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y

❑ Miss: data is not in the upper level so needs to be retrieve from a


block in the lower level (Blk Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to bring in a block from the lower level and replace a
block in the upper level with it + Time to deliver the block to the
processor
Hit Time << Miss Penalty
IT3030E, Fall 2022 10
Cache
❑ The memory hierarchy between the processor and main
memory
CPU fetch instructions and data from cache, if found (cache hit)
→ fast access.
If not found (cache miss) → load a block from main memory into
cache, then access in cache → slower access time (miss penalty)

CPU Main
Cache Blocks of data memory

Instruction fetch
Memory read/write

IT3030E, Fall 2022 11


Cache Basics
❑CPU needs to access a data item in memory
➔Two questions to answer (in hardware):
Q1: How do CPU know if the data item is in the cache?
Q2: If it is, how does CPU find it?

❑ Direct mapped
Each memory block is mapped to exactly one block in the cache
- lots of lower level blocks must share blocks in the cache
Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
The tag field: associated with each cache block that contains
the address information (the upper portion of the address)
required to identify the block (to answer Q1)
The valid bit: if there is data in the block or not

IT3030E, Fall 2022 12


Caching: A Simple First Example
Main Memory
0000xx
Cache 0001xx One word blocks
0010xx Two low order bits
Index Valid Tag Data define the byte in the
0011xx
word (32b words)
00 0100xx
01 0101xx
10 0110xx
11 0111xx
1000xx Q2: How does CPU
1001xx find it?
1010xx
Q1: Is it there? 1011xx Use next 2 low order
1100xx memory address bits
Compare the cache 1101xx – the index – to
tag to the high order 2 1110xx determine which
memory address bits to 1111xx cache block (i.e.,
tell if the memory block modulo the number of
is in the cache blocks in the cache)

IT3030E, Fall 2022


(block address) modulo (# of blocks in the cache) 13
Caching: A Simple First Example
Main Memory
0000xx
Cache 0001xx One word blocks
0010xx Two low order bits
Index Valid Tag Data define the byte in the
0011xx
word (32b words)
00 0100xx
01 0101xx
10 0110xx
11 0111xx
1000xx Q2: How do we find it?
1001xx
1010xx Use next 2 low order
Q1: Is it there? 1011xx memory address bits
1100xx – the index – to
Compare the cache 1101xx determine which
tag to the high order 2 1110xx cache block (i.e.,
memory address bits to 1111xx modulo the number of
tell if the memory block blocks in the cache)
is in the cache
(block address) modulo (# of blocks in the cache)
IT3030E, Fall 2022 14
Direct Mapped Cache
❑ Consider the main memory word reference string
Start with an empty cache - all 0 1 2 3 4 3 4 15
blocks initially marked as not valid

0 1 2 3

4 3 4 15

IT3030E, Fall 2022 15


Direct Mapped Cache
❑ Consider the main memory word reference string
Start with an empty cache - all 0 1 2 3 4 3 4 15
blocks initially marked as not valid

0 miss 1 miss 2 miss 3 miss


00 Mem(0) 00 Mem(0) 00 Mem(0) 00 Mem(0)
00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2)
00 Mem(3)

4 miss 3 hit 4 hit 15 miss


01 4
00 Mem(0) 01 Mem(4) 01 Mem(4) 01 Mem(4)
00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(1)
00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2)
00 Mem(3) 00 Mem(3) 00 Mem(3) 11 00 Mem(3)
15
8 requests, 6 misses
What if we repeatedly request 1,000,000 times
IT3030E, Fall 2022 16
Cache performance
❑ Given a MIPS CPU running a program with the miss rate of
instruction cache is 2% and the miss rate of data cache is 4%. The
processor has CPI of 2 without any memory stalls and the miss
penalty is 100 cycles for all misses
❑ Determine how much faster that processor would run with a perfect
cache that never missed. Assume the frequency of all loads and
stores is 36%.
❑ Solution:

IT3030E, Fall 2022 17


Cache performance
❑ Given a MIPS CPU running a program with the miss rate of instruction
cache is 2% and the miss rate of data cache is 4%. The processor has
CPI of 2 without any memory stalls and the miss penalty is 100 cycles
for all misses.
❑ Determine how much faster that processor would run with a perfect
cache that never missed. Assume the frequency of all loads and stores
is 36%.
❑ Solution:
❑ Given instruction count I
Instruction miss cycles = I * 2% * 100 = 2.00 * I
Data miss cycles = I * 36% * 4% * 100 = 1.44 * I

❑ Total mem-stall cycles: 2.00 I + 1.44 I = 3.44 I.

IT3030E, Fall 2022 18


Cache performance
❑ Given a MIPS CPU running a program with the miss rate of instruction
cache is 2% and the miss rate of data cache is 4%. The processor has
CPI of 2 without any memory stalls and the miss penalty is 100 cycles for
all misses
❑ Determine how much faster that processor would run with a perfect
cache that never missed. Assume the frequency of all loads and stores is
36%.
❑ What is the speed up if the CPU now has faster CPI of 1 (instead of 2)?

IT3030E, Fall 2022 19


MIPS Direct Mapped Cache Example
❑ One-word blocks, cache size = 1K words (or 4KB)
Byte
31 30 ... 13 12 11 ... 2 1 0
offset

Tag 20 10 Data
Hit
Index
Index Valid Tag Data
0
1
2
.
.
.
1021
1022
1023
20 32

What kind of locality are we taking advantage of?


IT3030E, Fall 2022 20
MIPS Direct Mapped Cache Example
❑ One-word blocks, cache size = 1K words (or 4KB)
Byte
31 30 ... 13 12 11 ... 2 1 0
offset

Tag 20 10 Data
Hit
Index
Index Valid Tag Data
0
1
2
.
.
.
1021
1022
1023
20 32

Calculate the total size of this cache in Kilobits


IT3030E, Fall 2022 21
Exercise
❑ How many total bits are required for a direct-mapped
cache with 16 KiB of data and 1-word blocks, assuming a
32-bit address?

IT3030E, Fall 2022 22


Multiword Block Direct Mapped Cache
❑ Four words/block, cache size = 1K words
31 30 . . . 13 12 11 ... 4 32 10
Byte
Hit offset Data

Tag 20 8 Block offset


Index

Index Valid Tag Data


0
1
2
.
.
.
253
254
255
20

32

What kind of locality are we taking advantage of?


IT3030E, Fall 2022 23
Taking Advantage of Spatial Locality
❑ Let cache block hold more than one word
Start with an empty cache - all 0 1 2 3 4 3 4 15
blocks initially marked as not valid

0 1 2

3 4 3

4 15

IT3030E, Fall 2022 24


Taking Advantage of Spatial Locality
❑ Let cache block hold more than one word
Start with an empty cache - all 0 1 2 3 4 3 4 15
blocks initially marked as not valid

0 miss 1 hit 2 miss


00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0)
00 Mem(3) Mem(2)

3 hit 4 miss 3 hit


01 5 4
00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4)
00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

4 hit 15 miss
01 Mem(5) Mem(4) 1101 Mem(5) Mem(4)
15 14
00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

8 requests, 4 misses

IT3030E, Fall 2022 25


Miss Rate vs Block Size vs Cache Size

8 KB
Miss rate (%)

16 KB
64 KB
256 KB

Block size (bytes)

❑ Miss rate goes up if the block size becomes a significant


fraction of the cache size because the number of blocks
that can be held in the same size cache is smaller
(increasing capacity misses)
IT3030E, Fall 2022 26
Cache Field Sizes
❑ The number of bits in a cache includes both the storage
for data and for the tags
32-bit byte address
For a direct mapped cache with 2n blocks, n bits are used for the
index
For a block size of 2m words (2m+2 bytes), m bits are used to
address the word within the block and 2 bits are used to address
the byte within the word
❑ What is the size of the tag field?
❑ The total number of bits in a direct-mapped cache is then
2n x (block size + tag field size + valid field size)
❑ How many total bits are required for a direct mapped
cache with 16KB of data and 4-word blocks assuming a
32-bit address?
IT3030E, Fall 2022 27
Exercise
❑ How many total bits are required for a direct-mapped
cache with 16 KiB of data and 4-word blocks, assuming a
32-bit address?

IT3030E, Fall 2022 28


Sources of Cache Misses
❑ Compulsory (cold start, first reference):
First access to a block.
We cannot do much on this.
Solution: increase block size (but also increases miss penalty).

❑ Capacity:
Cache cannot contain all blocks accessed by the program
Solution: increase cache size (may increase access time)

❑ Conflict (collision):
Multiple memory locations mapped to the same cache location
Solution 1: increase cache size
Solution 2: increase associativity (may increase access time)

IT3030E, Fall 2022 29


Reducing Cache Miss Rates #1
➔Allow more flexible block placement
❑ Direct mapped cache: a memory block maps to exactly
one cache block
❑ Fully associative cache allow a memory block to be
mapped to any cache block
❑ A compromise is to divide the cache into sets each of
which consists of n “ways” (n-way set associative). A
memory block maps to a unique set (specified by the
index field) and can be placed in any way of that set (so
there are n choices)
(block address) modulo (# sets in the cache)

IT3030E, Fall 2022 30


Another Reference String Mapping
❑ Consider the main memory word reference string
Start with an empty cache - all 0 4 0 4 0 4 0 4
blocks initially marked as not valid

0 4 0 4

0 4 0 4

IT3030E, Fall 2022 31


Another Reference String Mapping
❑ Consider the main memory word reference string
Start with an empty cache - all 0 4 0 4 0 4 0 4
blocks initially marked as not valid

0 miss 4 miss 0 miss 4 miss


01 4 00 0 01
00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0)4

0 miss 4 miss 0 miss 4 miss


00 01 4 00 0 01
0 4
01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0)

8 requests, 8 misses
❑ Ping pong effect due to conflict misses - two memory
locations that map into the same cache block
IT3030E, Fall 2022 32
Set Associative Cache Example
Main Memory
0000xx
One word blocks
Cache 0001xx
Two low order bits
0010xx define the byte in the
Way Set V Tag Data
0011xx word (32b words)
0 0100xx
0
1 0101xx
0 0110xx
1
1 0111xx
1000xx Q2: How do we find it?
1001xx
Q1: Is it there?
1010xx Use next 1 low order
1011xx memory address bit to
Compare all the cache
1100xx determine which
tags in the set to the
1101xx cache set (i.e., modulo
high order 3 memory
1110xx the number of sets in
address bits to tell if
1111xx the cache)
the memory block is in
the cache
IT3030E, Fall 2022 33
Another Reference String Mapping
❑ Consider the main memory word reference string
Start with an empty cache - all 0 4 0 4 0 4 0 4
blocks initially marked as not valid

0 miss 4 miss 0 hit 4 hit


000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)

010 Mem(4) 010 Mem(4) 010 Mem(4)

8 requests, 2 misses

❑ Solves the ping pong effect in a direct mapped cache


due to conflict misses since now two memory locations
that map into the same cache set can co-exist!

IT3030E, Fall 2022 34


Four-Way Set Associative Cache
❑ 28 = 256 sets each with four ways (each with one block)
31 30 ... 13 12 11 ... 2 1 0 Byte offset

Tag 22 8
Index
Index V Tag Data V Tag Data V Tag Data V Tag Data
0 0 0 0
1 1 1 1
2 Way 0 2 Way 1 2 Way 2 2 Way 3
. . . .
. . . .
. . . .
253 253 253 253
254 254 254 254
255 255 255 255

32

4x1 select

Hit Data
IT3030E, Fall 2022 35
Range of Set Associative Caches
❑ For a fixed size cache, increase of the number of blocks
per set results in decrease of the number of sets

Used for tag compare Selects the set Selects the word in the block

Tag Index Block offset Byte offset

Increasing associativity
Decreasing associativity
Fully associative
Direct mapped (only one set)
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
single comparator

IT3030E, Fall 2022 36


Benefits of Set Associative Caches
❑ The choice of direct mapped or set associative depends
on the cost of a miss versus the cost of implementation
12
4KB
10 8KB
16KB
8
Miss Rate

32KB
6 64KB
128KB
4 256KB
512KB
2
Data from Hennessy &
0 Patterson, Computer
1-way 2-way 4-way 8-way Architecture, 2003
Associativity

❑ Largest gains are in going from direct mapped to 2-way


(20%+ reduction in miss rate)
IT3030E, Fall 2022 37
Block replacement
❑ Cache miss: a new block is loaded to cache, it will
replace an old block
➔ Which block should be replaced?
❑ Direct-mapped cache: exactly one choice
❑ Associative cache: one of multiple blocks in the set must
be selected
➔ LRU scheme: (least recently used) block that has been
unused the longest time is selected for replacement.
Mechanism for relative last time used tracking is necessary.

IT3030E, Fall 2022 38


LRU block replacement
❑ Consider the main memory word reference string
Start with an empty cache - all 0 4 2 4 0 0 0 4
blocks initially marked as not valid
Last
used 0 miss 4 miss 2 miss 4 hit
x 000 Mem(0) 000 Mem(0) x 001 Mem(2) 001 Mem(2)

x 010 Mem(4) 010 Mem(4) x 010 Mem(4)

Last
used 0 miss 0 hit 0 hit 4 hit
x 000 Mem(0) x 000 Mem(0) x 000 Mem(0) 000 Mem(0)

010 Mem(4) 010 Mem(4) 010 Mem(4) x 010 Mem(4)

IT3030E, Fall 2022 39


Reducing Cache Miss Rates #2
➔Use multiple levels of caches
Very costly in 1990s: US$100000 or above
Common in 2020s: ~US$500 machines
❑ Normally a unified L2 cache (holding both instructions
and data, for each core) and a unified L3 cache shared
for all cores

IT3030E, Fall 2022 40


Multilevel Cache Design Considerations
❑ Design considerations for L1 and L2 caches are very
different
Primary cache should focus on minimizing hit time in support of
a shorter clock cycle
- Smaller with smaller block sizes
Secondary cache(s) should focus on reducing miss rate to
reduce the penalty of long main memory access times
- Larger with larger block sizes
- Higher levels of associativity

❑ The miss penalty of the L1 cache is significantly reduced


by the presence of an L2 cache – so it can be smaller but
have a higher miss rate
❑ For the L2 cache, hit time is less important than miss rate
The L2$ hit time determines L1$’s miss penalty
L2$ local miss rate >> than the global miss rate
IT3030E, Fall 2022 41
Example
❑ Given a processor with a base CPI of 1.0 and clock rate
of 4 GHz. Main memory access time is 100 ns.
All data references are hit in primary cache (L1).
Instruction miss rate of 2% in primary cache (L1).

❑ A new L2 is added
Access time from L1 to L2 is 5 ns.
Instruction miss rate (to main memory) reduced to 0.5%.

❑ What is speed-up after adding the L2?

IT3030E, Fall 2022 42


Answer
Mem

CPU L2

L1

CPI=1 5ns 100ns


f=4GHz 2% inst. missed 0.5% inst.
missed

❑ 𝐶𝑃𝐼 = 𝐵𝑎𝑠𝑒𝐶𝑃𝐼 + 𝑆𝑡𝑎𝑙𝑙𝐶𝑃𝐼 = 𝐵𝑎𝑠𝑒𝐶𝑃𝐼 + 𝐼𝑆𝑡𝑎𝑙𝑙 + 𝐷𝑆𝑡𝑎𝑙𝑙


❑ 𝐵𝑎𝑠𝑒𝐶𝑃𝐼 = 1, 𝐷𝑆𝑡𝑎𝑙𝑙 = 0,
❑ 5 𝑛𝑠 = 20 𝑐𝑦𝑐𝑙𝑒𝑠, 100 𝑛𝑠 = 400 𝑐𝑦𝑐𝑙𝑒𝑠
❑ Without L2: 𝐼𝑆𝑡𝑎𝑙𝑙 = 2% ∗ 400 = 8
❑ With L2: 𝐼𝑆𝑡𝑎𝑙𝑙 = 𝐼𝑆𝑡𝑎𝑙𝑙1 + 𝐼𝑆𝑡𝑎𝑙𝑙2 = 2% ∗ 20 + 0.05% ∗ 400 = 2.4
1+8
❑ 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = = 2.6
1+2.4

IT3030E, Fall 2022 43


Handling Cache Hits
❑ Read hits (I$ and D$)
this is what we want!

❑ Write hits (D$ only)


require the cache and memory to be consistent
- always write the data into both the cache block and the next level in
the memory hierarchy (write-through)
- writes run at the speed of the next level in the memory hierarchy – so
slow! – or can use a write buffer and stall only if the write buffer is full
allow cache and memory to be inconsistent
- write the data only into the cache block (write-back the cache block to
the next level in the memory hierarchy when that cache block is
“evicted”)
- need a dirty bit for each data cache block to tell if it needs to be
written back to memory when it is evicted – can use a write buffer to
help “buffer” write-backs of dirty blocks

IT3030E, Fall 2022 44


Handling Cache Misses (Single Word Blocks)
❑ Read misses (I$ and D$)
stall the pipeline, fetch the block from the next level in the memory
hierarchy, install it in the cache and send the requested word to
the processor, then let the pipeline resume
❑ Write misses (D$ only)
1. stall the pipeline, fetch the block from next level in the memory
hierarchy, install it in the cache (which may involve having to evict
a dirty block if using a write-back cache), write the word from the
processor to the cache, then let the pipeline resume
or
2. Write allocate – just write the word into the cache updating both
the tag and data, no need to check for cache hit, no need to stall
or
3. No-write allocate – skip the cache write (but must invalidate that
cache block since it will now hold stale data) and just write the
word to the write buffer (and eventually to the next memory level),
no need to stall if the write buffer isn’t full
IT3030E, Fall 2022 45
Multiword Block Considerations
❑ Read misses (I$ and D$)
Processed the same as for single word blocks – a miss returns
the entire block from memory
Miss penalty grows as block size grows
- Early restart – processor resumes execution as soon as the
requested word of the block is returned
- Requested word first – requested word is transferred from the
memory to the cache (and processor) first
Nonblocking cache – allows the processor to continue to access
the cache while the cache is handling an earlier miss
❑ Write misses (D$)
If using write allocate must first fetch the block from memory and
then write the word to the block (or could end up with a “garbled”
block in the cache (e.g., for 4-word blocks, a new tag, one word
of data from the new block, and three words of data from the old
block))
IT3030E, Fall 2022 46
Exercise
❑ Given a CPU with 32 bits address and the below word
reference string.
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
❑ Identify the binary address, tag field, block index field,
and hit ratio in the following cases.
The CPU has direct-mapped cache of 16 one-word blocks.
The CPU has direct-mapped cache of 8 two-word blocks.

IT3030E, Fall 2022 47


Exercise
❑ Given a CPU with 32 bits address and the below word
reference string.
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
❑ The CPU has direct-mapped cache with a total of 8 data
words. Miss penalty is 25 cycles.
❑ Which of the following designs is optimal given the above
reference string?
8x one-word blocks, access time of 2 cycles
4x two-word blocks, access time of 3 cycles.
2x four-word blocks, access time of 5 cycles.

IT3030E, Fall 2022 48


Virtual Memory
❑ Main memory (RAM) can be used as a “cache”
for secondary storage (disk), but not mainly for
performance.
Virtual memory
(very large main memory)

words blocks pages

CPU Cache Main memory Secondary memory


Purpose: Improving
Purpose???
performance
IT3030E, Fall 2022 49
Virtual Memory
❑ Multiple programs (processes) share one main memory
❑ Large programs can run on computer with small main
memory

RAM

Chris Terman MIT 6.004 Disk


Main memory (page file or
swap
code code code space)

data data data

Physical memory Virtual memory


(ex. 1 GB RAM) (4 GB address space)
IT3030E, Fall 2022 50
Relocation and Address translation
❑ Programs are located and run in virtual memory.
Each program has its own continuous address space (virtual
address).
Virtual address are mapped to physical address via translation.
Memory is organized in pages of fixed size (4KB - 64KB).

Do programs need to be allocated in Example: CPU with 32 bit address, but the
contiguous physical pages? computer has only 1GB of physical
memory
IT3030E, Fall 2022 51
Address Translation
❑ CPU accesses a memory location based on virtual
address: Virtual page number + Page offset
❑ If the virtual page number can be translated to physical
page number (hit) → memory access can be done
properly.
❑ Otherwise (miss): page fault → very expensive operation
New physical page is allocated for the running process
- If no free physical pages is available, move an “old” page to disk to
make space for the new page ➔ page replacement
Content for the new page is loaded from disk

IT3030E, Fall 2022 52


Page Tables
❑ Stores placement information of each program (process)
Array of Page Table Entries, indexed by virtual page number
Located in main memory
Page table register in CPU points to page table in physical
memory

❑ If page is present in memory


PTE stores the physical page number
Plus status bits (referenced, dirty, …)

❑ If page is not present


PTE can refer to location in swap space on disk

IT3030E, Fall 2022 53


Translation Using a Page Table

IT3030E, Fall 2022 54


Page Fault Penalty and Storage Mapping
❑ On page fault, the page must be fetched from disk
Usually together with page replacement
Takes millions of clock cycles
Handled by OS code

Memory pages can be stored


in disk page-file or swap space
Managed by OS

IT3030E, Fall 2022 55


Issues in virtual memory design
❑ Minimize cost for page fault and data write: minimize
page fault rate, and minimize disk write frequency
Fully associative placement
Smart replacement algorithms
Write back approach

❑ Fast address translation: this happens for every memory


access, it must be as fast as possible
Caching the page table: Translation Look-aside Buffer (TLB)

IT3030E, Fall 2022 56


Page Replacement and Writes
❑ Least-recently used (LRU) for page replacement
Can be quite slow when number of page is large
Reference bit (aka use bit) in PTE set to 1 on access to page
Periodically cleared to 0 by OS
Pages with reference bit = 0 are considered for replacement

❑ Disk writes take millions of cycles


Disk write is slow and should be done in batches of data.
→Write through is impractical
Use write-back
Dirty bit in PTE set when page is written

IT3030E, Fall 2022 57


Fast Translation Using TLB
❑ Address translation: two consecutive memory references
One to access the PTE, then the actual memory access
Has good locality → page table can be cached

❑ TLB (Translation Look-aside Buffer)


New component inside CPU
Provides fast access to the most recent PTEs
Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for
miss, 0.01%–1% miss rate
Only contains PTEs corresponding to physical pages

IT3030E, Fall 2022 58


Fast Translation Using a TLB

IT3030E, Fall 2022 59


TLB Miss Handler
❑ TLB miss indicates
Page present, but PTE not in TLB
Page not present

❑ Page present:
Handler copies PTE from memory to TLB
Then restarts instruction

❑ If page not present: page fault will occur

IT3030E, Fall 2022 60


Page Fault Handler
❑ Use faulting virtual address to find PTE (currently not
valid)
❑ Locate page on disk
❑ Choose a page in physical memory to replace
If dirty, write-back the chosen page to disk first

❑ Read page into memory and update page table


❑ Make process runnable again
Restart from faulting instruction

IT3030E, Fall 2022 61


TLB and Cache Interaction

Virtual Memory ❑ Physically addressed cache


Cache uses physical address
1 Need to translate before
2 cache lookup
Slow performance
3
❑ Virtually addressed cache
Physical Memory Skip TLB when in normal
cache access
Aliasing problem:
- Different virtual addresses for
shared physical address

❑ Compromise: virtually indexed


but physically tagged
No alias, but complicated
physical design
or virtual?

IT3030E, Fall 2022 62


Process and Memory protection
❑ Process: an instance of a program in
execution
❑ (Take the IT3070E - OS course for more details)

❑ With separate memory space (virtual)


❑ Share the common physical memory
❑ Important data of a process: PC, register’s
values, page table
❑ Memory must be protected
Read protection: processes not able to read
each other’s memory
Write protection: processes prohibited from
writing to other process’s memory

❑ Super process: the OS


IT3030E, Fall 2022 63
Memory Protection
❑ Read protection
Virtual pages of separate processes map to disjoint physical pages.
Placing page tables in protected address space of OS →
processes are not allowed to modify page tables.

❑ Sharing data
OS creates a page table entry for a virtual page of one process to
point to physical page of another page.
Write protection: use the write protection bit.

❑ Hardware support for protection (used by OS)


Special privileged supervisor mode (aka kernel mode) and
privileged instructions.
Page tables and other state information only accessible in
supervisor mode.
System call exception (e.g., syscall in MIPS) to go from user mode
to supervisor mode.
IT3030E, Fall 2022 64
Summary
❑ Memory hierarchy and the locality principal
❑ Cache design
Direct mapped
Set associative
Memory access when cache hit and miss

❑ Virtual memory
Address translation
TLB
Protection

IT3030E, Fall 2022 65

You might also like