0% found this document useful (0 votes)
20 views116 pages

Unit 5 Memory

Uploaded by

Vishal verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views116 pages

Unit 5 Memory

Uploaded by

Vishal verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 116

Memory

Memory: Basic Understanding

CPU + main memory as a big


array of bytes

CPU + memory controllers/chips +


I/O controllers/devices
Basic Memory Subsystem
CPU ­Memory Interface: address bus
data bus
uni­directional
address bus Read
CPU Write Memory
bi­directional data Ready
bus Size

read control line


write control line 16x8-bit memory array
ready control line 0000 1 0 1 1 0 0 1 0
0001
size (byte, word) 1 0 0 0 0 0 0 1
address 1-of-16
control line
decoder
Memory access: a
1111
memory bus 0 1 0 1 0 0 1 1

transaction
16x1-bit memory chip
Memory
Memories come in many shapes, sizes
and types
 RAM - random
access memory
ROM - read only
memory
 EPROM, FLASH -
electrically
programmable read
only memory
Source: Intel Seminar (R. Rajkumar)
Memory Types
DRAM: Dynamic Random Access Memory
Very dense (1 transistor per bit) and inexpensive
Requires refresh and often not the fastest access
times
SRAM: Static Random Access Memory
Fast and no refresh required
Not so dense and not so cheap
Often used for caches
ROM/Flash: Read­Only Memory
often used for bootstrapping
Basic Static RAM Cell
6-Transistor SRAM Cell
word
0 1 (row select)

Back-to-Back
0 1 inverters form
Read: bit bit flip-flop
1. Select row
2. Cell pulls one line low and one high
3. Sense output on bit and bit
Write:
1. Drive bit lines (e.g, bit=1, bit=0)
2. Select row
Simplified SRAM timing diagram

Read: Valid address, then Chip Select


Access Time: address good to data valid
even if not visible on out
Cycle Time: min between subsequent mem operations
Write: Valid address and data with WE_l, then CS
Address must be stable a setup time before WE and CS
go low
And hold time after one goes high
SRAM Read Timing
(typical)
stable stable stable

 tAA Max(tAA, tACS)


ADDR

tOH
CS
tACS

OE tAA tOZ tOE tOZ tOE

valid valid valid

DOUT
WE = HIGH

OE : enabling 3-state o/p buffers


SRAM Read Timing
Parameters
tAA (access time for address): how long it
takes to get stable output after a
change in address.
tACS (access time for chip select): how
long it takes to get stable output after
CS is asserted.
tOE (output enable time): how long it
takes for the three-state output buffers
to leave the high-impedance state when
OE and CS are both asserted.
More on Timing
Parameters
tOZ (output-disable time): how long
it takes for the three-state output
buffers to enter high-impedance
state after OE or CS are negated.
tOH (output-hold time): how long
the output data remains valid after
a change to the address inputs.
Embedded SRAM
Low density and high speed
Preferred choice for frequently
accessed, time-critical storage
Cache and register files
Power is a serious concern
Static power due to leakage
Dynamic power due to switching of long and
heavily loaded bit and word lines and sense
amplifiers in read-out circuits
Multi-ported Memory
Motivation:
Consider CPU core register file:
One read or write per cycle limits processor
performance.
Complicates pipelining. Difficult for different
instructions to simultaneously read or write
register file/on-chip memory.
Common arrangement in pipelined CPUs is 2
read ports and 1 write port.
Dual-ported
Memory
Add decoder, another
set of read/write logic,
Internals
bits lines, word lines:
WL2
WL1

deca decb cell


array
b2 b1 b1 b2
r/w logic

r/w logic

address
ports data ports
Dynamic RAM

SRAM cells exhibit Word


Line

high speed/poor
C
density
DRAM: simple .
Bit
Line
transistor/capacitor .
.
pairs in high density Sense
form Amp

Refresh at regular
intervals
DRAM Organization
d x w DRAM:
dw total bits organized as d supercells
of size w bits 16 x 8 DRAM chip
cols
0 1 2 3
2 bits 0
/
addr
1
rows
memory supercell
2
controller (2,1)
(to CPU)
3
8 bits
/
data

internal row buffer


Reading DRAM Super-cell (2,1)
Step 1: Row access strobe (RAS) selects row 2.
Step 2: Row 2 copied from DRAM array to row buffer.
16 x 8 DRAM chip
cols
0 1 2 3
RAS = 2
2
/ 0
addr
1
rows
memory
controller 2

8 3
/
data

internal row buffer


Reading DRAM Supercell (2,1)
Step 3: Column access strobe (CAS) selects column 1.
Step 4: Supercell (2,1) copied from buffer to data lines, and
eventually back to the CPU.
16 x 8 DRAM chip

cols
0 1 2 3
CAS = 1
2
/ 0
To CPU addr
1
rows
memory
controller 2

supercell 3
8
(2,1) /
data

supercell
internal row buffer
(2,1)
Memory Organisation
addr (row = i, col = j)
Memory Modules
DRAM 0
64 MB
memory module
consisting of
DRAM 7 eight 8Mx8 DRAMs

bits bits bits bits bits bits bits bits


56-63 48-55 40-47 32-39 24-31 16-23 8-15 0-7

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Memory
controller
64-bit doubleword at main memory address A

64-bit doubleword
DRAM Timing Parameters
tRAC: minimum time from RAS line falling
to the valid data output.
Quoted as the speed of a DRAM when buy
A typical 4Mb DRAM tRAC = 60 ns
tRC: minimum time from the start of one
row access to the start of the next.
tRC = 110 ns for a 4Mbit DRAM with a tRAC of
60 ns
More Timing Parameters
tCAC: minimum time from CAS line falling
to valid data output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC: minimum time from the start of one
column access to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns
Enhanced DRAMs
All enhanced DRAMs are built around the
conventional DRAM core.
Synchronous DRAM (SDRAM)
Driven with rising clock edge instead of
asynchronous control signals.
SDRAM is tied to the system clock and is designed
to be able to read or write from memory in burst
mode (after the initial read or write latency) at 1
clock cycle per access (zero wait states)
Double data-rate synchronous DRAM (DDR
SDRAM)
Enhancement of SDRAM that uses both clock edges
as control signals.
Embedded DRAM
Provides high density storage
Up to 10 times larger than SRAM
Significantly slower than SRAM
Requires dedicated process for on
chip fabrication
Not well compatible with standard
CMOS technology for logic
implementation
Non-volatile Memory
Mask ROM
Used for dedicated functionality
Contents fixed at IC fab time (truly write once!)
EPROM (erasable programmable)
Requires special IC process
(floating gate technology)
Writing is slower than RAM. EPROM uses special
programming system to provide special voltages and
timing.
Reading can be made fairly fast.
Rewriting is slow.
erasure is first required , EPROM - UV light exposure,
EEPROM – electrically erasable
Flash
Floating Gate MOS
Floating gate is
surrounded by
silicon dioxide ,
which is an
excellent insulator
By controlling
the terminal
voltage, it is
possible to charge
electrically the
floating gate.
Source: L. Benini et al. ACM trans., Feb 2003
EEPROM
Erased using higher than normal
voltage
Can be erased by words and not in
entirety
In circuit programmable
Read in tens of nanoseconds &
Writes in tens of microseconds.
Flash Memory
• Uses single transistor per bit
• EEPROM employs two transistors
• A flash memory provides high density
storage with speed marginally less
than that of SRAM’s
• Write time is significantly higher
compared to DRAM
FLASH Memory
Electrically erasable
In system programmability and erasability (no
special system or voltages needed)
On-chip circuitry and voltage generators to
control erasure and programming (writing)
Erasure happens in variable sized "sectors" in
a flash (16K - 64K Bytes)
Flash NAND

Compact flash cards


uses NAND flash
Small chip size
Fast burst mode
access
NOR
Micro-controllers
usually use NOR flash
Fast random access

Source: Toshiba presentation: What is NAND flash memory, March 2003


Embedded Non-volatile Storage
On-chip non-volatile storage is used for
storage of
Configuration information
Executable code that runs on core processors
Higher read-bandwidths and less pin-out
requirements
Application specific tailoring of bit-width and
memory size
Security of proprietary code
Recorded data : repeated write
Current FGMOS-based memories can withstand
more than 106 rewrites
Interfacing Memory
ARM System Architecture

Need a Mechanism to interface and


access memory units
ARM7 Memory Interface Signals
32 bit address A[31:0]
32 Bi-directional Data
D[31:0]
Separate Data in and out
Din[31:0] & Dout[31:0]
mreq and seq for
requesting memory
access
r/w for read/ write
indication
mas[1:0] for data size
identification: word 10,
half-word 01 and byte 00.
All activities controlled by
mclk.
Simple Memory Interface
•4 SRAMs
•write
enabled
separately
•Read
enabled
together
•4 ROMs
•No write
enable
•SRAM Size: •Read
2n x 32 enabled
•ROM Size: together
2m x 32
Simple Memory Decoder
Control Controls the Activation
of RAM and ROM
mas[0] mas[1]
A[0] A[1] a[31]: 0  ROM
mclk
a[31]: 1  RAM

RAMwe0 It controls the byte


write enables
RAMwe1
during write
RAMwe2
mas[1:0]: 00
Byte, 01 H-word,
RAMwe3 It ensures
10 Word that
data is ready
before processor
r/w
RAMoe
continues.
ROMoe

A[31]
SRAM/ROM Memory
Timing
Address should be stable during
the falling edge of output enable
SRAM is fast, ROM is slow
ROM needs more time. Slows the
system
Use Wait states; more complex control

mclk

A[31:0] A B C
RAMOe
ROM Wait Control State
Transition
Example
ROM access requires 4 clock cycles
RAM access is faster

reset
fast
ROM RAM

ROM1 ROM2 ROM3


Timing Diagram for for ROM
Wait States
mclk

A[31:0]

wait

ROM0e
fast ROM1 ROM2 ROM3
Operation
Processor internal operations cycles do
not need access to memory
Memory Access is much slower than internal
operations.
Internal
Use wait states for memory Accesses
Operations can
mreq = 1 internal operation run at max speed
mreq = 0 memory access
reset

RAM
mreq decode RAM

ROM

ROM1 ROM2 ROM3


Reviewing DRAM
Organisation

ras
First Row address
is presented and

decoder
array of

latch
latched by ras memory
cells
signal
Next column A[n:0]
address is
presented and latch
latched by cas mux
signal
Data out
cas
Making DRAM Access Fast
Accessing data in the same row using
cas-only access is 2 – 3 times faster
cas-only access does not activate the
cell matrix
If next accesses is within the same row,
a new column address may be presented
just by applying a cas-only access.
Exploiting the fact: most processor
addresses are sequential
DRAM Access
If we had a way of knowing that that
the next address is sequential with
respect with the current address
(current address + 4), then we could
only assert cas and make DRAM
access fast
Difficulty?
Detecting early in memory access cycle
that the next address is in the same row
ARM Solution to cas-only
Access
• Sequential addresses flagged by seq
signal
• The external memory device checks
previous address and row boundaries to
issue cas only or ras-cas combination
Revised State Transition
Diagram
seq = 1: sequential address
seq = 0: non-sequential
mreq = 1 internal operation
mreg = 0 memory access
reset

RAM
decode RAM

mreq seq
ROM seq

ROM1 ROM2 ROM3


DRAM Timing Diagram
Notice the pipelined memory access
Address is presented 1/2 cycle earlier

mclk

A[31:0] A A+4 A+8

seq

wait
ras

cas

D[31:0]
N cycle S cycle S cycle
New Address Sequential Address
Summary
We have learnt about different types
of memory and their characteristics
We have seen how external memory
can be interfaced in ARM7 based
system
We shall study memory organisation
and use in future classes
Memory Organisation
Memory Centric View of
Embedded System

L1 Off-chip memory
CPU
Cache

Scratch Pad
Memory (SPM)
Memory Organization
Memory system starts with Register file
Cache or caches feed data and instructions
to the pipeline
Most embedded systems use one level cache
Main memory system may be contained
partially on-chip and partially off-chip
Scratch pad memories have been proposed
as one form of high speed on-chip memory
Off-chip a variety of technologies may be
used, including SDRAM
Partitioning of Data
Memory

10-20
cycles

Ref: Panda et al., ACM trans ADES, 2001


Scratch Pad Memory
SPM data memory residing on-chip which
is mapped into an address space disjoint
from off-chip memory
Both Cache and SPM allow fast access to
data whereas off-chip requires relatively
longer access time
SPM guarantees single-cycle access
where as an access to cache is subject to
cache misses
Cache dynamically maps addresses from
larger slower memory
Cache
A Cache is a small, fast memory that
holds copies of some of the contents of
the main memory
Exploiting
Temporal locality
Same data/instruction accessed soon enough
Spatial locality
Data/instruction likely to be accessed from the
neighbourhood of current instruction
A cache controller mediates between the
CPU and main memory
Caches and CPUs
address data
cache

controller
cache main
CPU
memory
address
data data
Cache controller uses different portions of the
address issued by the processor during a
memory request to select parts of Cache
memory
ACK: COMPUTER AS COMPONENT, WAYNE WOLF
Definitions
Working Set: Set of memory locations CPU
refers to at any one time
Cache Hit: When an address requested by
CPU is found in Cache
Cache Miss: If the location is not in Cache
Compulsory Miss ( Cold miss): occurs the first
time a location is used
Capacity Miss: is caused by a too large working
set
Conflict Miss: when two locations map to the
same location in the cache
Cache Operation
A miss causes cache controller to copy
the data from main memory to cache
Data is forwarded to the CPU at the same
time: data streaming
Data occupying cache block is evicted
and replaced by the content of
memory addresses requested by the
CPU
Dirty bit: a status bit to indicate whether
cache content has changed
Direct-Mapped Cache
In a direct-mapped cache, each
memory address is associated with
one possible block within the cache
To look in a single location in the cache
for the data if it exists in the cache
Block is the unit of transfer between
cache and memory
Direct-Mapped
4 Byte Direct
Cache (2) Cache
Index Mapped Cache
Memory
0 0
1 1
2 2
3 3
4
5 Block size = 1 byte
6 Cache Location 0 can be
7 occupied by data from:
8 Memory location 0, 4, 8, ..
9 In general: a fixed mapping from
A memory locations to Cache
B
C
D
E
F
Direct-mapped cache
1 0xabcd byte byte byte ...
valid tag data
cache block

Two regularly
tag index offset used memory
location maps to
= the same location
Index is used to select leading to
cache block to check hit value
byte conflict misses
ACK: COMPUTER AS COMPONENT, WAYNE WOLF
Direct-mapped cache
organization
address

tag RAM data RAM

compare mux

hit data

Source: Steve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000.
Set-associative Cache
Consists of a number of sets
Characterized by the number of sets it uses
Each set is implemented by direct mapped cache
Memory locations map onto blocks as in Direct
mapped cache
There are n separate blocks for each memory location
Each request is broadcast to all sets
simultaneously
If any of the sets has the location the cache reports a
hit
Set-associative cache
A set of direct-mapped caches:

Set 1 Set 2 ... Set n

Less miss rate


Slower than
direct
hit data mapped
ACK: COMPUTER AS COMPONENT, WAYNE WOLF Cache
address

2-way set-
associative
tag RAM data RAM

cache
organization
compare mux

hit data

compare mux

Source: Steve Furber: ARM


System On-Chip; 2nd Ed, tag RAM data RAM
Addison-Wesley, 2000.
CAM
CAM : Content addressable memory
Uses a set of comparators to
compare input tag address with a
cache-tag stored in each valid cache
block
CAM produces an address if a given
data value exists in memory
CAM enables many more cache-tags
to be compared simultaneously
Fully associative cache
organization
address

tag CAM data RAM

CAM used
in
ARM920T
mux &
hit data
ARM940T
Source: Steve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000.
Cache in ARM
Von Neumann Architecture
Unified Cache: A single cache for
instruction and data
Harvard Architecture:
Two caches: Split Cache
Instruction Cache & Data Cache
A unified instruction and
data cache
FF..FF16

registers
instructions

processor

instructions
address and data
data
copies of
instructions address
copies of
data
memory
cache
instructions 00..0016
and data

Source: Steve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000.
FF..FF16
copies of
instructions address

Separate cache
instructions

data and address instructions


instructions

instructio registers

n caches processor

address data data

address
copies of
data
data memory
cache
00..0016

Source:Steve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000.


Write Buffer
A small fast FIFO that temporarily holds
data that processor would write to main
memory
Buffer is emptied to slower main memory
Reduces time for the processor to write small
blocks of sequential data to main memory
Write buffers improve cache performance
Cache controller, during block eviction, writes
dirty block to write buffer instead of main
memory
Write Buffer
Word/byte
access
Block transfer

cache main
CPU Write
memory
buffer

Word/byte access - slow


Some write buffers are not strictly FIFO; ARM
10 family supports coalescing
Cache Policies
Write Policy
Writethrough
Cache controller writes to both cache and
memory
Writeback
Controller writes only to cache and sets
dirty bit true
Better cache performance at the risk of
data inconsistency for a longer period
Cache Policies (2)
Block replacement Policy
Round-robin or cyclic replacement:
predictable performance
Random Selection : improving worst
case behaviour
Allocation Policy
Read allocate: allocates only on read
Read-write allocate: allocates on
either read or write
ARM Cached Core Policy
Core Write Policy Replaceme Allocatio
nt Policy n Policy
ARM720T writethroug random Read-
h miss
ARM740T writethroug random Read-
h miss
ARM920T Writethroug Random, Read-
h,writeback round-robin miss
ARM946E Writethroug Random, Read-
h,writeback round-robin miss
Cache Control in ARM
All standard memory facilities are controlled
by System Control Coprocessor (CP 15)
CP 15 has registers using which features
and control functions of Cache are specified
Size of Cache and degree of associativity
Enabling/disabling of cache operations
Policy choices: write, replacement, allocation
Memory areas can be indicated as Cachable
or not
Memory mapped I/O locations are not cachable
or bufferable (use of write buffer not possible)
Cache Lockdown
Cache Lockdown allows critical code
and data to be loaded into cache in
such a way that corresponding cache
blocks are not re-allocated
high-priority interrupt routines and data
that they access
For lockdown purpose cache is divided
into lockdown blocks in ARM
One block from each cache set
Multi-level Cache
To minimize cache miss-rate due to
limitations in capacity
System Designers add, may be,
level 2 (L2) cache
L1 cache will have normally single
cycle access and L2 cache will have
latency of more than one CPU cycle but
less than that of main system memory
Hierarchical Memory
Architecture

Optimizes data transfer rate and energy consumption


Energy requirement for DRAM access is more than 10
times that of L2 cache access
Source: L. Benini et al. ACM trans., Feb 2003
Energy Aware Memory
Organization
Memory partitioned into blocks
Blocks enabled /disabled
independently
Energy for memory access is less
when memory banks are small
But large number of small banks is
area inefficient and impose wiring
overhead
Partitioned Memory
Architecture

Source: L. Benini et al. ACM trans., Feb 2003


Summary
We have studied organization of
memory in Embedded Systems
We have looked at the use of
Cache
Next we shall discuss memory
management and use
Virtual memory and protection
Virtual Memory and
Memory Management Unit
Virtual Memory
Use Physical DRAM as a Cache for the
Secondary Storage
Address space of a process can exceed physical
memory size
Sum of address spaces of multiple processes
can exceed physical memory
Simplify Memory Management
Multiple processes resident in main memory.
Each process with its own address space
Only “active” code and data is actually in
memory
Allocate more memory to process as needed.
Virtual Memory
Provide Protection
One process can’t interfere with
another.
because they operate in different address
spaces.
User process cannot access privileged
information
different sections of address spaces have
different permissions.
Memory Management Unit
Memory Management Unit provides the key
service of managing tasks as independent
programs running in their own private space
MMU simplifies programming of application
tasks by providing resources for managing
virtual memory
MMU acts as a translator which converts
addresses of programs and data to actual
physical addresses
Enables relocation of programs with virtual
address to any part of memory
Memory management unit
Memory management unit (MMU)
translates addresses:

logical physical
address memory address main
CPU management
memory
unit
Segmentation based Memory
Management
Segmentation provided by simple MMU
Program views its memory as set of segments.
Code segment, Data Segment, Stack segment,
etc.
Each program has its own set of private
segments.
Each access to memory is via a segment selector
and offset within the segment.
It allows a program to have its own private view
of memory and to coexist transparently with
other programs in the same memory space.
Segment
segment selector logical address
based
Address
Generation
base bound
Segment Descriptor Table (SDT)

+ >?

physical address access fault


Base: The base address of the segment
Logical address: an offset within a segment
Bound: Segment limit
SDT: Holds Access and other information about the
segment
Paging
Logical Memory area can be bigger
than the physical memory
Logical memory can be
accommodated in the secondary
memory
Divide physical memory into equal
sized chunks
Any chunk of Virtual Memory
assigned to any chuck of Physical
Memory (“page”)
Using Paged Memory
Memory

0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses
1:

CPU

Address Translation: P-1:


N-1:
Hardware converts
virtual addresses to
physical addresses via Additional storage
an OS-managed lookup
table (page table)
Page Faults
What if an address is not in in memory?
Page table entry indicates virtual address not in memory
OS exception handler invoked to move data from disk/or
other secondary storage into memory
current process suspends, others can resume
OS has full control over placement, etc.

Before fault Memory


After fault Memory
Page Table
Virtual Page Table
Physical
Addresses Addresses Virtual Physical
Addresses Addresses
CPU CPU

Disk
Disk
Servicing a Page Fault
Processor Signals Controller (1) Initiate Block Read
Read block of length P
starting at disk address X Processor
Processor
Reg
and store starting at (3) Read
memory address Y Done
Read Occurs Cache
Cache
Direct Memory Access (DMA)
Under control of I/O
controller Memory-I/O
Memory-I/Obus
bus
I / O Controller Signals
Completion I/O
I/O
Interrupt the processor controller
Memory
Memory controller
OS resumes suspended
process
(2) DMA
disk
Disk disk
Disk
Transfer
Managing Multiple
Processes
Each process has its own virtual address
space
operating system controls how virtual pages
are assigned to physical memory
A page table for each process
every program can start at the same
address (virtual address)!
A process should not access pages not
allocated to it
Protection
Page table entry contains access rights
information
Page Tables Memory
Read? Write? Physical Addr 0:
VP 0: Yes No PP 9 1:
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX
• • •
• • •
• • •

Read? Write? Physical Addr


VP 0: Yes Yes PP 6
Process j: VP 1: Yes No PP 9 N-1:
VP 2: No No XXXXXXX
• • •
• • • Hardware enforces
• • •
this protection (trap
into OS if violation
Page address translation
page offset

page i base

concatenate

page offset
Address Translation via Page
Table
virtual address
page table base register
n–1 p p–1 0
VPN acts as virtual page number (VPN) page offset
table index
valid access physical page number (PPN)

if valid=0
then page
not in memory m–1 p p–1 0
physical page number (PPN) page offset

physical address
Page Table Operations
Translation
Separate (set of) page table(s) per process
VPN forms index into page table (points to a
page table entry)
Computing Physical Address
Page Table Entry (PTE) provides information
about page
if (valid bit = 1) then the page is in memory.
Use physical page number (PPN) to construct address
if (valid bit = 0) then the page is on disk
Page fault
Must load page from disk into main memory before
continuing
Page Table Operations (2)
Checking Protection
Access rights field indicate allowable
access
e.g., read-only, read-write, execute-only
typically support multiple protection
modes (e.g., kernel vs. user)
Protection violation fault if process
doesn’t have necessary permission
Integrating VM and
Cache
VA PA miss
Trans- Main
CPU Cache
lation Memory
hit
data

Most Caches “Physically Addressed”


Accessed by physical addresses
Allows multiple processes to have blocks in
cache at same time
Allows multiple processes to share pages
Cache doesn’t need to be concerned with
protection issues
Access rights checked as part of address
translation
VM and Cache
Perform Address Translation Before
Cache Lookup
But this could involve a memory
access itself (of the PTE)
Page table entries can also become
cached
Speeding up Translation
with a TLB
Translation Lookaside Buffer” (TLB)
Small hardware cache in MMU
Maps virtual page numbers to
physical page numbers
Contains complete page table entries
for small number of pages
TLB
hit
VA PA miss
TLB Main
CPU Cache
Lookup Memory

miss hit

Trans-
lation
data
Address Translation
with a TLB
n–1 p p–1 0
virtual page number page offset virtual address

valid tag physical page number


. TLB
. .

TLB hit
physical address

tag index byte offset

valid tag data


Cache

cache hit data


Multi-Level Page
Tables
Given:
4KB (212) page size
32-bit address space Level 2
4-byte PTE Tables
Problem:
Would need a 4 MB page table!
220 *4 bytes
Level 1
Common solution
Table
multi-level page tables
e.g., 2-level table
Level 1 table: 1024 entries, each of
which points to a Level 2 page table.
Level 2 table: 1024 entries, each of
which points to a page ...
MMU in ARM
ARM MMU performs several tasks
Translates virtual addresses into physical address
Controls memory access permission
Determines behavior of the cache and write buffer
for each page in memory
When MMU is disabled all virtual addresses
map one-to-one to the same physical address
MMU will only abort on translation,
permission and domain faults
Control Components of
MMU
Configuration and control
components in the MMU are
Page tables
TLB
Domains and access permission
Caches and write buffer
CP15:c1 control register
Fast Context switch extension
Page tables
ARM MMU has a multilevel page table
architecture
Two levels L1 and L2
Master L1 page table divides the 4GB space into 1
MB sections
L2 page table can support 1 KB, 4KB or 64 KB
pages
CP15:c2 register holds the translation
table base address – an address
pointing to the location of the master L1
table in virtual memory
Translation Lookaside
Buffer
TLB is a special cache of recently
used page translations
TLB maps a virtual page to an
active page and stores control
data restricting access to the page
ARM920T, ARM922T, ARM92EJ-S,
ARM1022E, ARM1026EJ-S support
locking translations in the TLB
The operation of a
translation look-aside
31 buffer 12 11
logical address
0

hit

logical page physical page


number number

physical address
31 12 11 0
Domains and Memory
Access
Domains control basic access to
virtual memory by isolating one
area of memory from another
when sharing a common virtual
map
There are 16 different domains that
can be assigned to 1 MB sections of
virtual memory
Caches and Write Buffer
Configure the caches and write
buffer for each page in memory
Indicates whether a page will be
cached and whether write buffer for
the page be enabled or not
Use of Virtual Memory
System
Example
Implementation of a static multi-tasking
system running concurrent tasks
Tasks can have overlapping virtual memory
map
May be located in physical memory at
addresses that do not overlap
Configure domain access and permissions in
the page table to protect the system
Demand paging not necessarily implemented
Multi-tasking and MMU
During context switch different
page table activated
Virtual to physical mapping change
To ensure cache coherency, Caches
may need cleaning and flushing
TLB also need flushing
MMU can relocate the task without
the need to move it
Multi-tasking and MMU
To reduce time for context switch
writethrough cache policy can be
followed in ARM9
Data cache configured as
writethrough will not require cleaning
of data cache
Demand Paging
Use flash memory as non-volatile
store
Disks in appliances like PDA’s etc.
copy programs to RAM during system
operation
Dynamic Paging with load on demand
Write-back policy for the pages
because access time to flash is much
more compared to RAM
Demand Paging with Nand-
flash

MCU
MMU

OS OS
APP1 APP1
APP2
STACK, HEAP
FILE SYSTEM
SDRAM
NAND

Park et al. IEEE ISELPD, 2004


Page Replacement Policy
LRU is the most commonly used
policy
With Nand-flash a different policy
Because write cost for an evicting
page is higher
Keep dirty pages as long as possible
Evict least recently used clean(non-dirty)
pages first and then dirty pages
Summary
We have studied different types of
memory
Examined Cache and Virtual
memory organisation
We have understood functional
role of MMU and Cache controller

You might also like